Google's TPU 8t and 8i: Training Powerhouse vs. Inference Specialist

Google unveiled its eighth‑generation TPU line at Cloud Next 2026, introducing the training‑focused TPU 8t with a 2.7× performance boost and massive scaling, and the inference‑optimized TPU 8i featuring three‑times more on‑chip SRAM and an 80% performance uplift for agentic AI workloads, while positioning the chips as a complement—not a replacement—to Nvidia's offerings.

Machine Heart
Machine Heart
Machine Heart
Google's TPU 8t and 8i: Training Powerhouse vs. Inference Specialist

At Google Cloud Next 2026 the company announced the eighth‑generation Tensor Processing Unit (TPU) family, splitting the product line into two dedicated chips: TPU 8t for large‑scale model training and TPU 8i for latency‑sensitive inference.

TPU 8t – training engine

Performance gain: Compared with the previous‑generation Ironwood TPU, TPU 8t delivers a 2.7× increase in raw throughput.

Scale: A single TPU 8t super‑node can expand to 9,600 chips and 2 PB of shared high‑bandwidth memory, providing 121 ExaFLOPs of compute power.

Utilization: Integrated 10× faster storage access and TPUDirect data loading improve end‑to‑end system utilization.

Near‑linear scaling: The new Virg network combined with JAX and Pathways enables near‑linear scaling to millions of chips.

Architecture: Core SparseCore accelerates irregular embedding lookups; VPU/MXU overlap balances vector and matrix operations; native FP4 reduces memory bandwidth pressure while preserving model accuracy.

Reliability: Over 97% "good utilization" metric, with real‑time telemetry, automatic fault detection and redirection.

TPU 8i – inference engine

Memory wall breakthrough: 384 MB on‑chip SRAM (three‑times the previous generation) and 288 GB high‑bandwidth memory keep the active working set on silicon, cutting latency for large‑context decoding.

Axion‑based CPUs: Custom Arm‑based CPUs double the number of physical hosts per server and use NUMA isolation for performance gains.

Boardfly network: Inter‑chip bandwidth raised to 19.2 Tb/s; network diameter reduced by >50%, enabling unified low‑latency operation for massive Mixture‑of‑Experts (MoE) models.

Collective Acceleration Engine (CAE): Offloads global operations, reducing on‑chip latency by up to 5× and directly increasing throughput for millions of concurrent agents.

Cost efficiency: Performance per dollar improves by roughly 80%, allowing providers to serve more customers at the same cost.

Google has been designing its own AI processors since 2015 and began offering them to Cloud customers in 2018 to lessen reliance on external vendors such as Nvidia. The new TPU 8t and 8i are intended to supplement existing Nvidia‑based infrastructure rather than replace it, and Google also plans to make Nvidia’s latest Vera Rubin chips available later this year.

Public reaction highlighted a perceived shift in AI compute bottlenecks from FLOPs to memory bandwidth and latency, and many commenters noted that the dual‑chip strategy could intensify competition with Nvidia.

Agentic AIInferenceGoogle CloudtrainingAI hardwarechip architectureTPU
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.