NVIDIA Introduces Inference Context Memory Storage

NVIDIA has introduced the Inference Context Memory Storage (ICMS) platform within its Rubin architecture to address scaling limits of agentic AI memory. The platform creates a G3.5 Ethernet-attached flash tier using BlueField-4 and Spectrum-X, enabling prestaged KV cache to deliver up to 5x tokens-per-second and 5x power efficiency for long-context inference. Vendors plan compatible systems in the second half of this year, impacting datacentre design and orchestration.
Key Points
- 1Introduces ICMS G3.5 tier to offload KV cache from GPU HBM to ethernet-attached flash
- 2Reduces latency and cost by prestaging context, enabling up to 5x TPS and 5x power efficiency
- 3Requires topology-aware orchestration and BlueField-4/Spectrum-X integration, altering datacentre layout and capacity planning
Scoring Rationale
Official NVIDIA platform adds a novel memory tier with measurable TPS and efficiency gains; vendor support gives it practical deployability.
Sources
Public references used for this report.
Practice with real Telecom & ISP data
90 SQL & Python problems · 15 industry datasets
250 free problems · No credit card
See all Telecom & ISP problems