Nvidia Targets Agentic Inference with Blackwell Ultra

NVIDIA announced the Blackwell Ultra platform and rack-scale GB300 NVL72 and HGX B300 NVL16 systems in a May 2026 company release, and NVIDIA marketing materials report up to 50x higher throughput per megawatt and 35x lower cost per token versus the prior Hopper generation (NVIDIA blog and press release). CryptoBriefing and other coverage note partner activity, including VAST Data integrations and cloud deployments by Microsoft, CoreWeave and Oracle Cloud Infrastructure. Editorial analysis: Agentic inference workloads emphasize long-lived context, memory bandwidth and low-latency data access rather than short-lived stateless throughput, shifting the bottlenecks that infrastructure teams must plan for.
What happened
NVIDIA announced the Blackwell Ultra AI factory platform in May 2026, introducing the GB300 NVL72 rack-scale solution and the HGX B300 NVL16 system, per NVIDIA's May 8, 2026 press release and company blog. NVIDIA's materials report that the GB300 NVL72 delivers up to 50x higher throughput per megawatt and 35x lower cost per token for agentic inference workloads compared with the Hopper generation (NVIDIA blog; NVIDIA press release). The press release includes a direct quote from Jensen Huang: "We designed Blackwell Ultra for this moment, it's a single versatile platform that can easily and efficiently do pretraining, post-training and reasoning AI inference," said Jensen Huang, founder and CEO of NVIDIA (NVIDIA press release).
CryptoBriefing and NVIDIA blog coverage describe partner activity around the stack, citing integrations with VAST Data and deployments or planned availability on clouds used by Microsoft, CoreWeave and Oracle Cloud Infrastructure (CryptoBriefing; NVIDIA blog). Developer and vendor materials also show ecosystem components such as the NVIDIA Dynamo inference framework and enterprise reference architectures to support rack-scale Blackwell Ultra deployments (NVIDIA developer blog; NVIDIA Enterprise Reference Architecture page).
Technical details
Per NVIDIA's product materials, the GB300 NVL72 connects 72 Blackwell Ultra GPUs in a rack-scale design and pairs GPUs with high-capacity memory subsystems; the vendor states Blackwell Ultra GPUs feature 288 GB of HBM3e memory per GPU in some configurations and emphasize test-time scaling for reasoning and long-context inference (NVIDIA press release; NVIDIA resources page). NVIDIA also highlights the HGX B300 NVL16 as offering multiple-fold improvements in inference throughput and memory capacity over the Hopper generation (NVIDIA press release).
The DeepSeek developer blog documents complementary model-side trends: the DeepSeek-V4 family supports up to 1M-token context windows and uses MoE and hybrid attention techniques to reduce per-token FLOPs and KV cache memory burden, which the developer frames as necessary for long-context and agentic workflows (DeepSeek blog). MLPerf and third-party benchmark commentary referenced in vendor posts show Blackwell-class hardware delivering strong scaling in recent inference suites, per Nebius and MLPerf v6.0 commentary (Nebius blog snippet; MLPerf reporting).
Editorial analysis - technical context
Agentic inference workloads combine large, long-context models, persistent or searchable context stores, and low-latency tool integrations. Industry-pattern observations show that as context windows and multi-step execution grow, pressures shift away from single-GPU raw throughput toward system-level designs that balance processing, KV-cache capacity, memory bandwidth and networked storage performance. Rack-scale GPU assemblies, larger per-GPU HBM capacity, and co-designed inference software are common responses across recent platform releases.
Context and significance
Public reporting frames Blackwell Ultra as a vendor-level response to the transition from conversational generative workloads to autonomous, agentic systems that maintain state across multi-step tasks (CryptoBriefing; NVIDIA blog). For practitioners, this matters because the dominant cost and performance constraints for deployed agents are increasingly about sustained inference at scale, cache management, and low-latency access to external data and tools rather than raw training FLOPs alone.
What to watch
- •Adoption signals: independent MLPerf v6.0 and cloud provider benchmark results for GB300 NVL72 and HGX B300 NVL16 hardware (MLPerf; Nebius commentary).
- •Ecosystem maturity: availability of rack-scale reference architectures and managed GB300-based services from major clouds (NVIDIA Enterprise Reference Architecture; NVIDIA press materials).
- •Model-engine co-design: whether more models publish optimizations for KV-cache and hybrid-attention patterns for 1M-token contexts, following examples like DeepSeek-V4 (DeepSeek blog).
Editorial analysis: Observers will also monitor third-party cost-per-token measurements and real-world latency under multi-tool, multi-step agent workloads to validate vendor performance claims at production scale.
Scoring Rationale
This is a major hardware-platform announcement with vendor claims of large performance and cost gains for long-context, agentic inference. It materially affects infrastructure planning for production agents and cloud providers, though independent benchmarks and real-world deployments will determine actual impact.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

