WEKA Validates 10x Throughput on OCI H100

WEKA and Oracle Cloud Infrastructure announced joint production-scale benchmarks showing that WEKA's NeuralMesh platform with Augmented Memory Grid on OCI H100 bare-metal infrastructure delivered significant throughput and density improvements versus DRAM-only configurations. According to the joint press release published on PR Newswire and WEKA's newsroom on June 9, 2026, the tests on a nine-node OCI bare-metal H100 cluster (72 GPUs) with 100,000-token context windows showed 10x more concurrent users, 10x higher token throughput, and 7x more tokens per GPU compared to the DRAM-only baseline. The Oracle blog post supplying system details notes the configuration used 16x Gen4 NVMe drives per node pooled into an Augmented Memory Grid. Pablo Selem, senior director, software development, Oracle Cloud Infrastructure, is quoted in the announcement on the benchmarks' implications for memory bottlenecks and GPU utilization.
What happened
WEKA and Oracle Cloud Infrastructure published joint production-scale benchmark results showing that WEKA's NeuralMesh platform with Augmented Memory Grid on OCI H100 bare-metal servers materially increased inference density and throughput compared with a DRAM-only baseline. Per the joint press release on PR Newswire and WEKA's newsroom (June 9, 2026), the nine-node OCI bare-metal H100 cluster (72 GPUs) with 100,000-token context windows delivered 10x more concurrent users, 10x higher token throughput, and 7x more tokens per GPU versus DRAM-only configurations. The announcement includes a quoted remark from Pablo Selem, senior director, software development, Oracle Cloud Infrastructure: "Enterprise AI workloads are pushing context windows and GPU utilization to new limits," attributing the benchmarks to removing memory bottlenecks so customers can support larger inference workloads.
Technical details
Per the Oracle blog and WEKA release, the validated cluster used 9 OCI bare-metal nodes with 8 H100 GPUs per node (72 GPUs total) and 16x Gen4 NVMe drives per node (3.84 TiB each) pooled into an Augmented Memory Grid, expanding the active cache working set from 8.64 TiB of DRAM to 287 TiB of usable NVMe in the test. The announcement states NeuralMesh with Augmented Memory Grid reached approximately 2,000,000 tokens per second on OCI versus under 200,000 tokens per second for the DRAM-only baseline, and showed scaling past 5,000 concurrent users versus about 600 on the baseline. The Oracle blog also references prior validation work in 2025 showing near 20x faster time to first token (TTFT) at 128K context using vLLM comparisons, framing the current effort as a production-scale evaluation of serving economics and SLO behavior.
Editorial analysis: Industry context
Companies building for long-context and agentic AI workloads face a common bottleneck: cache capacity and eviction behavior that force expensive GPU recomputation, known in the industry as the prefill toll. Industry-pattern observations: extending working memory with NVMe-backed layers, as WEKA and OCI demonstrate, is a practical lever to increase serving density and amortize GPU cost across more users and tokens, but it trades on careful engineering of caching policies, NVMe throughput, and application-level SLO targets. Independent verification, workload diversity, and tail-latency measurements remain important for practitioners because flash-backed memory approaches can widen variance in per-request latency even as they improve aggregate throughput.
What to watch
For practitioners: monitor these indicators to judge generalisability and operational risk:
- •Independent third-party benchmarks that replicate the 9-node, 72-GPU, 100,000-token configuration and publish latency percentiles.
- •Reported SLO and tail-latency behavior under mixed workloads, not just aggregate tokens-per-second.
- •Cost-per-token and cost-per-concurrent-user comparisons that include NVMe capacity and OCI bare-metal pricing.
- •Integration and compatibility notes for popular serving frameworks such as vLLM and other cache-aware runtimes.
- •Customer case studies or marketplace listings on Oracle Cloud Marketplace that show real-world adoption and operational recipes.
Bottom line
Per the companies' joint announcement and accompanying Oracle blog, the Augmented Memory Grid approach materially increased tokens served and concurrent-user capacity on OCI H100 hardware in the vendor's production-scale tests. Editorial analysis: organizations evaluating long-context inference should view NVMe-backed augmentation as a proven architectural option to raise serving density, while treating vendor benchmarks as a starting point that requires independent validation against their workloads and SLOs.
Scoring Rationale
The announcement reports a significant vendor-validated improvement in long-context inference density on OCI H100 hardware, which is directly relevant to practitioners running large-context LLMs. The score reflects meaningful operational implications but is tempered because results come from vendor-led benchmarks and require independent replication.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


