Google Releases Eighth-Generation TPU Chips for Agentic AI

Google unveils its eighth-generation Tensor Processing Units, splitting into a training-focused TPU 8t and an inference-optimized TPU 8i. The new systems integrate Googles custom Arm-based Axion host CPUs and the Virgo Network fabric to remove host and interconnect bottlenecks. TPU 8t scales to superpods of 9,600 chips delivering 121 ExaFlops and 2 PB of shared high-bandwidth memory for frontier-model training. TPU 8i is designed for low-latency, multi-agent inference with 384 MB on-chip SRAM, doubled interconnect bandwidth, and a Boardfly topology that halves network diameter. Both chips target Googles Gemini models and will be accessible via bare-metal and cloud offerings, with native support for JAX, PyTorch, and vLLM. Availability is expected later this year.
What happened
Google announced its eighth-generation Tensor Processing Units, introducing two purpose-built chips: `TPU 8t` for large-scale training and `TPU 8i` for low-latency inference. The systems integrate Googles custom Axion Arm-based host CPUs and the Virgo Network fabric to address host and network bottlenecks. A single TPU 8t superpod now scales to 9,600 chips, providing 121 ExaFlops of compute and 2 PB of shared high-bandwidth memory. The TPU 8i brings 384 MB of on-chip SRAM, doubled interconnect bandwidth to 19.2 Tb/s, and a Boardfly topology that reduces network diameter by over 50%.
Technical details
The two-chip split reflects a design choice to specialize hardware by operational phase. TPU 8t emphasizes pod-level scale, near-linear scaling via TPUDirect and the Virgo fabric, and faster storage access to shorten model-development cycles. TPU 8i prioritizes in-memory working sets and latency, keeping active agent state on-chip to avoid remote memory hits during multi-agent coordination. Google exposes bare-metal access and native software support for JAX, PyTorch, and vLLM, enabling direct hardware control and optimized runtimes. Key hardware and system features include:
- •High-scale training: superpods up to 9,600 chips, 121 ExaFlops, 2 PB shared HBM
- •Low-latency inference: 384 MB on-chip SRAM, 19.2 Tb/s interconnect, Boardfly topology
- •Host-level improvements: integrated Axion Arm CPUs to eliminate data-prep stalls
- •Network and IO: Virgo Network fabric, TPUDirect for 10x faster storage access, near-linear scaling toward large logical clusters
Context and significance
This launch signals Googles push to align hardware tightly with the emerging agentic AI workload class, where fleets of specialized models preserve state, coordinate, and perform long-horizon reasoning. By separating training and inference silicon and integrating first-party host CPUs, Google is optimizing for the distinct memory and latency profiles that make agentic systems expensive on generic GPUs. The move also strengthens Google Clouds competitive positioning against GPU incumbents; TPU 8i is explicitly framed as a first-class inference alternative, and the broader AI Hypercomputer stack (Axion VMs, Virgo fabric, managed parallel file systems) reduces end-to-end operational friction for enterprises running thousands of agents.
Why it matters for practitioners
Expect meaningful changes to cost-performance tradeoffs for large-model training and production agent deployments. Teams that can leverage bare-metal TPU 8t superpods will see faster iteration on frontier models, while latency-sensitive services can benefit from TPU 8is on-chip working set strategy. The promise of native support for common ML frameworks lowers the integration burden, but third-party benchmarks and real-world TCO analyses will be essential to validate claims.
What to watch
Availability and pricing later this year, the extent of bare-metal and cloud instance SKUs, independent benchmark results versus contemporary GPU offerings, and how quickly major model providers and open-source runtimes optimize for the new interconnect and memory models. Also watch Nvidia and other silicon vendors for counteroffers on inference-optimized hardware or pricing adjustments.
Scoring Rationale
This is a major infrastructure announcement with direct implications for training and inference economics, especially for agentic workloads. It strengthens Google Cloud's competitive position and may force GPU vendors to respond. The score reflects high practitioner relevance without reaching paradigm-shifting levels.
Practice interview problems based on real data
1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.
Try 250 free problemsStep-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.


