NVIDIA and Google Cloud Expand AI Hypercomputer Platform

At Google Cloud Next 2026, Google Cloud announced an expansion of its AI Hypercomputer infrastructure, adding TPU 8t and TPU 8i, A5X bare-metal instances powered by NVIDIA Vera Rubin NVL72, Axion N4A VMs, Google Compute Engine 4th-generation VMs, Virgo Network, Google Cloud Managed Lustre, Z4M VMs, and a Dedicated KV Cache, according to a Google Cloud blog post. NVIDIA's blog detailed the A5X (Vera Rubin) systems and said they can scale to about 80,000 Rubin GPUs in a single site and up to 960,000 Rubin GPUs across multisite clusters, and described up to 10x lower inference cost per token and 10x higher token throughput per megawatt versus the prior generation. Google and NVIDIA also previewed Gemini on Google Distributed Cloud, the Gemini Enterprise Agent Platform, confidential VMs with NVIDIA Blackwell GPUs, and support for NVIDIA Nemotron and NeMo. A Google press release and PR Newswire report that Thinking Machines Lab will expand use of the AI Hypercomputer with A4X Max VMs backed by NVIDIA GB300 NVL72, and reported roughly 2x training and serving speed in early tests.
What happened
Google Cloud announced a major expansion of its AI Hypercomputer infrastructure at Google Cloud Next 2026, outlining both new Google-designed hardware and broader integrations with NVIDIA accelerators, per a Google Cloud blog post dated April 22, 2026. The Google Cloud post lists new capacity and components including TPU 8t and TPU 8i, A5X bare-metal instances powered by NVIDIA Vera Rubin NVL72, Axion N4A VMs, Google Compute Engine 4th-generation VMs, Virgo Network, Google Cloud Managed Lustre, Z4M VMs, and a Dedicated KV Cache.
NVIDIA's corporate blog described the same announcements and provided additional scale and performance claims for the Vera Rubin systems: NVIDIA stated the A5X architecture can scale to roughly 80,000 Rubin GPUs in a single-site cluster and up to 960,000 Rubin GPUs across multisite clusters, and that A5X delivers up to 10x lower inference cost per token and 10x higher token throughput per megawatt versus the prior generation (NVIDIA blog, Apr 22, 2026). The NVIDIA post also previews Gemini on Google Distributed Cloud, the Gemini Enterprise Agent Platform, confidential VMs using NVIDIA Blackwell GPUs, and support for NVIDIA microservices and model tooling such as Nemotron and NeMo.
Thinking Machines Lab announced an expanded agreement to use Google Cloud's AI Hypercomputer, saying it will run on A4X Max VMs with NVIDIA GB300 NVL72; a Google press release and PR Newswire state TML observed about 2x faster training and serving in early tests compared with prior-generation GPUs. The press materials quote Myle Ott, Founding Researcher at Thinking Machines Lab: "By leveraging A4X Max and the AI Hypercomputer integrated stack, Google Cloud got us running at record speed with the reliability we demand." Mark Lohmeyer also appears quoted in Google material (Google Cloud blog; PR Newswire).
Editorial analysis - technical context
The announcements combine three technical trends that practitioners should note: co-designed hardware-software stacks, hyperscale networking, and integrated on-prem/cloud continuity. The first trend appears in Google and NVIDIA language around rack-scale Vera Rubin systems and custom instances such as A5X and A4X Max, which vendors frame as reducing token-level inference cost through tighter hardware-software codesign (Google Cloud blog; NVIDIA blog). The second trend is hyperscale fabric: Google highlights Virgo Network and Jupiter-like weight-transfer networking to enable near-instantaneous model weight movement, which is critical for large-model distributed training and reinforcement learning workloads (Google Cloud blog; PR Newswire).
Industry-pattern observations: companies building at the frontier increasingly package fast local storage, RDMA-capable VMs, and dedicated caching subsystems together with accelerators to keep communication and IO from becoming bottlenecks. The public accounts here emphasize Google Cloud Managed Lustre, Z4M local SSD VMs, and Dedicated KV Cache as pieces of that pattern (Google Cloud blog).
Context and significance
For large-model research and production teams, the announcements formalize another hyperscaler push to offer end-to-end "AI factory" stacks that include specialized accelerators, networking, storage, orchestration, and agent runtimes. SiliconANGLE characterizes the partnership language as extending a decade-long coengineering effort between NVIDIA and Google Cloud that began with earlier GPU generations and now targets "agentic" and "physical AI" workloads (SiliconANGLE, Apr 23, 2026). The Thinking Machines Lab agreement provides a concrete customer example and an early performance data point (PR Newswire; Google Cloud press corner).
For practitioners evaluating options, the new instance families and network fabric are relevant when workload economics hinge on inference cost-per-token or when reinforcement learning and continuous training require very fast weight transfers. The vendors' public claims about 10x improvements and multisite GPU scaling are high-impact performance assertions; those claims are sourced to NVIDIA and Google materials and should be validated with independent benchmarks and workload-specific tests before operational commitments.
What to watch
- •Availability and pricing timelines for A5X and A4X Max instances across regions and as bare-metal offerings (Google Cloud blog; NVIDIA blog).
- •Third-party benchmark results that confirm the 10x inference-cost and throughput claims NVIDIA published (NVIDIA blog).
- •Additional customer case studies beyond Thinking Machines Lab showing end-to-end operational cost and reliability at scale (PR Newswire).
- •How Gemini Enterprise Agent Platform integrations and NVIDIA microservices (Nemotron, NeMo) are packaged for hybrid and edge deployments via Google Distributed Cloud (Google and NVIDIA posts).
Scoring Rationale
The announcements materially expand hyperscaler infrastructure for large-model training and agentic workloads, with vendor claims of big efficiency and scale gains and a named customer (Thinking Machines) reporting 2x performance. This matters for practitioners planning frontier training and deployment, but the vendor claims require independent validation.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems


