Infrastructurelong contextinferenceoracle ociweka

WEKA Validates 10x Throughput on OCI H100

|June 9, 2026|By LDS Team

6.2

Relevance Score

WEKA Validates 10x Throughput on OCI H100 — Photo: manilatimes.net · rights & takedowns

WEKA and Oracle Cloud Infrastructure announced joint production-scale benchmarks showing that WEKA's NeuralMesh platform with Augmented Memory Grid on OCI H100 bare-metal infrastructure delivered significant throughput and density improvements versus DRAM-only configurations. According to the joint press release published on PR Newswire and WEKA's newsroom on June 9, 2026, the tests on a nine-node OCI bare-metal H100 cluster (72 GPUs) with 100,000-token context windows showed 10x more concurrent users, 10x higher token throughput, and 7x more tokens per GPU compared to the DRAM-only baseline. The Oracle blog post supplying system details notes the configuration used 16x Gen4 NVMe drives per node pooled into an Augmented Memory Grid. Pablo Selem, senior director, software development, Oracle Cloud Infrastructure, is quoted in the announcement on the benchmarks' implications for memory bottlenecks and GPU utilization.

What happened

WEKA and Oracle Cloud Infrastructure published joint production-scale benchmark results showing that WEKA's NeuralMesh platform with Augmented Memory Grid on OCI H100 bare-metal servers materially increased inference density and throughput compared with a DRAM-only baseline. Per the joint press release on PR Newswire and WEKA's newsroom (June 9, 2026), the nine-node OCI bare-metal H100 cluster (72 GPUs) with 100,000-token context windows delivered 10x more concurrent users, 10x higher token throughput, and 7x more tokens per GPU versus DRAM-only configurations. The announcement includes a quoted remark from Pablo Selem, senior director, software development, Oracle Cloud Infrastructure: "Enterprise AI workloads are pushing context windows and GPU utilization to new limits," attributing the benchmarks to removing memory bottlenecks so customers can support larger inference workloads.

Technical details

Per the Oracle blog and WEKA release, the validated cluster used 9 OCI bare-metal nodes with 8 H100 GPUs per node (72 GPUs total) and 16x Gen4 NVMe drives per node (3.84 TiB each) pooled into an Augmented Memory Grid, expanding the active cache working set from 8.64 TiB of DRAM to 287 TiB of usable NVMe in the test. The announcement states NeuralMesh with Augmented Memory Grid reached approximately 2,000,000 tokens per second on OCI versus under 200,000 tokens per second for the DRAM-only baseline, and showed scaling past 5,000 concurrent users versus about 600 on the baseline. The Oracle blog also references prior validation work in 2025 showing near 20x faster time to first token (TTFT) at 128K context using vLLM comparisons, framing the current effort as a production-scale evaluation of serving economics and SLO behavior.

Editorial analysis: Industry context

Companies building for long-context and agentic AI workloads face a common bottleneck: cache capacity and eviction behavior that force expensive GPU recomputation, known in the industry as the prefill toll. Industry-pattern observations: extending working memory with NVMe-backed layers, as WEKA and OCI demonstrate, is a practical lever to increase serving density and amortize GPU cost across more users and tokens, but it trades on careful engineering of caching policies, NVMe throughput, and application-level SLO targets. Independent verification, workload diversity, and tail-latency measurements remain important for practitioners because flash-backed memory approaches can widen variance in per-request latency even as they improve aggregate throughput.

What to watch

For practitioners

monitor these indicators to judge generalisability and operational risk:

•Independent third-party benchmarks that replicate the 9-node, 72-GPU, 100,000-token configuration and publish latency percentiles.
•Reported SLO and tail-latency behavior under mixed workloads, not just aggregate tokens-per-second.
•Cost-per-token and cost-per-concurrent-user comparisons that include NVMe capacity and OCI bare-metal pricing.
•Integration and compatibility notes for popular serving frameworks such as vLLM and other cache-aware runtimes.
•Customer case studies or marketplace listings on Oracle Cloud Marketplace that show real-world adoption and operational recipes.

Bottom line

Per the companies' joint announcement and accompanying Oracle blog, the Augmented Memory Grid approach materially increased tokens served and concurrent-user capacity on OCI H100 hardware in the vendor's production-scale tests. Editorial analysis: organizations evaluating long-context inference should view NVMe-backed augmentation as a proven architectural option to raise serving density, while treating vendor benchmarks as a starting point that requires independent validation against their workloads and SLOs.

Key Points

1WEKA and OCI reported 10x improvements in concurrent users and token throughput on a 72-GPU H100 cluster, improving serving density.
2Augmented Memory Grid extends cache from 8.64 TiB DRAM to 287 TiB usable NVMe, raising aggregate tokens but requiring careful latency and SLO validation.
3For practitioners, NVMe-backed memory layers are a practical way to lower cost per token, but independent benchmarks and tail-latency measures are essential.

Scoring Rationale

A vendor-joint announcement (WEKA and Oracle) reporting 10x inference throughput gains on OCI H100 hardware using NVMe-backed KV cache augmentation. The technical approach is directly relevant to practitioners running long-context LLMs and the specific numbers are notable, but results are vendor-reported without independent validation, placing this in the solid range rather than notable.

Sources

Public references used for this report.

8 sources

weka.ioWEKA and Oracle Cloud Infrastructure Validate 10x Throughput Gains

prnewswire.comWEKA and Oracle Cloud Infrastructure Validate 10x Throughput Gains for Long-Context AI Inference

blogs.oracle.comScaling Long-Context Inference on OCI with WEKA Augmented Memory Grid

View 5 more sources

Practice interview problems based on real data

1,625 SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems

What happened

Technical details

Editorial analysis: Industry context

What to watch

For practitioners

monitor these indicators to judge generalisability and operational risk:

•Independent third-party benchmarks that replicate the 9-node, 72-GPU, 100,000-token configuration and publish latency percentiles.
•Reported SLO and tail-latency behavior under mixed workloads, not just aggregate tokens-per-second.
•Cost-per-token and cost-per-concurrent-user comparisons that include NVMe capacity and OCI bare-metal pricing.
•Integration and compatibility notes for popular serving frameworks such as vLLM and other cache-aware runtimes.
•Customer case studies or marketplace listings on Oracle Cloud Marketplace that show real-world adoption and operational recipes.

Bottom line

Key Points

1WEKA and OCI reported 10x improvements in concurrent users and token throughput on a 72-GPU H100 cluster, improving serving density.

2Augmented Memory Grid extends cache from 8.64 TiB DRAM to 287 TiB usable NVMe, raising aggregate tokens but requiring careful latency and SLO validation.

3For practitioners, NVMe-backed memory layers are a practical way to lower cost per token, but independent benchmarks and tail-latency measures are essential.

Scoring Rationale

WEKA Validates 10x Throughput on OCI H100

What happened

Technical details

Editorial analysis: Industry context

What to watch

For practitioners

Bottom line

Key Points

Scoring Rationale

Sources

More AI & Data Science News

Ghost Font Uses Motion to Confound AI Vision

AegisAI Raises $36 Million to Expand AI Email Security

Delaware Court Lets Google AI Defamation Case Proceed

OpenAI Explores APIs for Deeper ChatGPT Wearable Integrations

WEKA Validates 10x Throughput on OCI H100

What happened

Technical details

Editorial analysis: Industry context

What to watch

For practitioners

Bottom line

Key Points

Scoring Rationale

Sources

More AI & Data Science News

Ghost Font Uses Motion to Confound AI Vision

AegisAI Raises $36 Million to Expand AI Email Security

Delaware Court Lets Google AI Defamation Case Proceed

OpenAI Explores APIs for Deeper ChatGPT Wearable Integrations