Google Unveils A5X Scaling to 960,000 GPUs

Google Cloud and NVIDIA unveiled the `A5X` AI infrastructure, a rack-scale offering built on NVIDIA Vera Rubin NVL72 systems, ConnectX-9 SuperNICs, and Google networking to deliver up to 10x token throughput per megawatt and 10x lower inference cost per token versus prior generation. A5X supports clusters up to 80,000 Rubin GPUs at a single site and federated multi-site deployments up to 960,000 Rubin GPUs. The launch also previews Google Gemini running on Blackwell and Blackwell Ultra GPUs in Google Distributed Cloud, and introduces confidential VMs with Blackwell hardware for encrypted inference. The stack targets agentic and physical AI workloads, enabling large-scale training, ultra-low-cost inference, and on-premises or sensitive-data deployments for enterprises building AI factories, robots, and digital twins.
What happened
Google Cloud and NVIDIA announced the next-generation AI infrastructure offering, `A5X`, a rack-scale system built on `Vera Rubin NVL72` hardware, ConnectX-9 SuperNICs, and Google networking. The codesigned stack claims up to 10x higher token throughput per megawatt and 10x lower inference cost per token compared to the previous generation. A5X scales to 80,000 Rubin GPUs in a single-site cluster and to 960,000 Rubin GPUs across multi-site clusters, and it introduces previews of running `Gemini` on Blackwell and Blackwell Ultra GPUs within Google Distributed Cloud and confidential VMs.
Technical details
The announcement emphasizes extreme codesign across chips, systems, and software to reach the stated efficiency gains. Key components include:
- •`Vera Rubin NVL72` rack-scale systems providing the GPU building blocks
- •ConnectX-9 SuperNICs paired with Google's next-generation Virgo networking fabric for high-throughput, low-latency interconnect
- •Multi-site orchestration that federates up to 960,000 Rubin GPUs, and single-site clusters up to 80,000 GPUs
What practitioners need to know
The 10x claims target token throughput per megawatt and inference cost per token; those metrics matter for inference economics and for running long-horizon agentic workloads. Running Gemini on Blackwell hardware inside Google Distributed Cloud and confidential VMs gives enterprises a path to deploy frontier models close to sensitive data while retaining hardware-level confidentiality. The stack integrates with agent frameworks and NVIDIA toolchains such as Nemotron and NeMo for model development and optimization.
Context and significance
This is a clear infrastructure play to make hyperscale model training and inference more economical and to enable "AI factories" - production deployments that coordinate agents, digital twins, and robots in industrial settings. The multi-site pooling number, 960,000 GPUs, is notable because it signals both ambition and an engineering bet on wide-area GPU fabric, global orchestration, and power-efficient datacenter interconnect. For cloud users, lower inference cost per token can change the unit economics of large-scale chatbots, real-time agents, and robotics workloads. For competitors, the combination of confidential Blackwell VMs and on-premise Distributed Cloud support raises the bar for private, regulated deployments.
Tradeoffs and open questions
Scaling to near-million-GPU federations is technically difficult. Key operational constraints include cross-site latency, model parallel synchronization, failure modes at scale, and realistic utilization that delivers the claimed cost savings. Details on pricing, region availability, SLAs, scheduler semantics for multi-site jobs, and support for open-source model runtimes were not disclosed in full. The confidential Blackwell capability is strategically important for regulated industries, but auditors and security teams will ask for attestation details and threat models.
What to watch
Availability dates, pricing and instance types for A5X bare-metal, the software toolchain Google exposes for cross-site orchestration, and benchmarks on real-world agentic workloads. Also watch how competitors respond on multi-site scaling and confidential GPU offerings.
Bottom line
A5X is an infrastructure-first push to make very large training and inference workloads cheaper and to support sensitive, distributed deployments of frontier models. For teams building agentic systems, robotics, or large-scale inference services, the combination of codesigned hardware, high-performance networking, and confidential GPU options materially changes the deployment choices available today.
Scoring Rationale
This is a major infrastructure announcement that materially affects cost and scale for training and inference, especially for agentic and physical AI. The near-million GPU federation and confidential Blackwell support are industry-significant, though outcome depends on availability, price, and operational practicality.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.


