Google Expands A5X Infrastructure With NVIDIA Vera Rubin GPUs

Google and NVIDIA announced the A5X bare-metal instances powered by NVIDIA Vera Rubin NVL72 rack-scale systems, according to Google Cloud and NVIDIA blog posts published at Google Cloud Next 2026. Per those posts, A5X uses ConnectX-9 NICs and Google's Virgo data-center fabric to support up to 80,000 Rubin GPUs in a single-site cluster and up to 960,000 Rubin GPUs across multisite clusters. NVIDIA and Google claim the A5X platform delivers 10x lower inference cost per token and 10x higher token throughput per megawatt versus the prior generation. The announcement also ties A5X into Google's AI Hypercomputer stack and Nvidia/GCP tooling for agentic and physical AI, including integrations with Gemini and NVIDIA's model and runtime ecosystem.
What happened
Google Cloud announced new AI infrastructure at Google Cloud Next 2026, including A5X bare-metal instances built on NVIDIA Vera Rubin NVL72 rack-scale systems, in a blog post authored by Amin Vahdat and Mark Lohmeyer on April 22, 2026. According to Google's post and NVIDIA's corporate blog, the A5X platform will use ConnectX-9 network adapters and Google's Virgo networking fabric to scale to 80,000 Rubin GPUs within a single site cluster and up to 960,000 Rubin GPUs across multisite clusters.
Technical details
Editorial analysis - technical context: The announcements describe a rack-scale design (NVL72) combined with high-performance NICs (ConnectX-9) and a purpose-built fabric (Virgo). Industry reporting from Wccftech, SiliconAngle, and Google Cloud's own blog emphasizes that these elements are intended to enable very large, federated GPU clusters for agentic and physical AI workloads. From a practitioner perspective, moving to rack-scale GPU systems plus high-speed network accelerators is consistent with trends that trade denser hardware packaging and richer network fabrics for lower per-token inference cost and higher throughput at megawatt scale.
Context and significance
Industry context
Google and NVIDIA have collaborated on accelerated cloud infrastructure for roughly a decade, a partnership documented in NVIDIA and Google Cloud posts and summarized by SiliconAngle. The combined stack ties A5X into Google's AI Hypercomputer portfolio and into NVIDIA's model/runtime ecosystem, including references to Gemini, NeMo, and enterprise agent platforms in NVIDIA's blog. For enterprises and ML engineering teams, the core significance is that hyperscale cloud providers are offering configurations that promise substantially lower inference energy and cost at extreme scale, while also bundling ecosystem software to support agentic workflows.
What to watch
For practitioners: availability, pricing, and regional limits will be critical to adoption. Public posts provide the headline scale numbers but do not enumerate per-customer quotas, pricing, or early-access performance benchmarks. Observers should watch for published throughput and latency microbenchmarks on common agentic workloads, the availability of software primitives for multi-site model parallelism, and how orchestration layers (Vertex AI, GKE, DGX Cloud) integrate ConnectX-9 and Virgo networking features. Also monitor announcements on confidential VM support with Blackwell GPUs and the rollout schedule for A5X in Google Cloud regions, which Google's product posts indicate but do not finalize.
Reported claims and sources
- •The scale figures and efficiency claims are stated in Google Cloud's April 22, 2026 blog post by Amin Vahdat and Mark Lohmeyer and in NVIDIA's corporate blog summary from April 22, 2026. (Google Cloud blog; NVIDIA blog.)
- •Independent coverage summarizing these claims appears in SiliconAngle, Wccftech, Seeking Alpha, and other trade outlets that covered Google Cloud Next 2026.
Editorial analysis: While vendor claims highlight 10x improvements in cost and throughput per megawatt, implementing and validating those gains for specific workloads will depend on real-world factors such as model parallelism strategy, batch sizing, communication patterns, and end-to-end software stack efficiency. Teams planning to target extreme-scale agentic inference should prioritize end-to-end testing on representative pipelines and confirm integration of orchestration and fault-tolerance features across multi-site deployments.
Scoring Rationale
This is a notable infrastructure advance: near‑million‑GPU multi-site scaling materially raises ceiling for agentic and physical AI deployments. The story affects capacity planning and cost models for large inference workloads, though practical impact depends on availability, pricing, and real-world benchmarks.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problems

