Google’s 8th‑Gen TPU Splits Training and Inference – A Direct Challenge to Nvidia’s One‑Chip Dominance
At Next 2026 Google unveiled the 8th‑generation TPU, separating training and inference into two dedicated chips—TPU 8t with 121 ExaFLOPS for massive models and TPU 8i with ultra‑low latency memory—while boosting performance, efficiency, and ecosystem support, signaling a shift toward specialized AI hardware and intensifying competition with Nvidia.
Launch and overall strategy
On April 22, 2026 at the Next 2026 conference, Google announced the 8th‑generation TPU, abandoning the previous “one‑chip‑does‑all” model and introducing two dedicated chips: TPU 8t for large‑model training and TPU 8i for inference.
TPU 8t – training chip
Co‑designed with Broadcom, TPU 8t integrates 9 600 chips into a logical cluster that shares 2 PB of high‑bandwidth memory and delivers 121 ExaFLOPS of compute—about three times the overall compute performance of the 7th‑gen Ironwood TPU (https://mp.weixin.qq.com/s?__biz=MzAxNzU3NjcxOA==∣=2650764675&idx=1&sn=fe9fab54acb232efea28a384f959f7d4). Power efficiency improves up to 2×. A built‑in SparseCore accelerator provides native FP4 support for irregular memory access patterns. Autonomous fault‑tolerant routing with real‑time telemetry and optical circuit switching (OCS) can reconfigure the hardware topology without human intervention.
TPU 8i – inference chip
Developed with MediaTek, TPU 8i tackles the inference “memory wall”. Each chip contains 288 GB of high‑bandwidth HBM plus 384 MB of on‑chip SRAM (three‑fold increase over the previous generation), allowing the core working set of models to stay on‑chip and halving latency. The boardfly hierarchical network groups four chips as a unit; 36 such units form a massive cluster, with any two chips communicating in at most seven hops. A new collective‑communication engine reduces on‑chip latency by fivefold. Compared with the prior generation, TPU 8i improves cost‑performance by 80 % and per‑watt performance by 117 %.
Ecosystem and software support
Both chips are fabricated on TSMC’s 2 nm process and paired with Google’s custom Arm‑based Axion CPU and fourth‑generation liquid‑cooling system. The TPU 8 series natively supports PyTorch 2.x, eliminating the need for torch_xla compatibility layers, and integrates the Pallas kernel development toolkit for fine‑grained memory control. Google plans to open the chips for use in the second half of 2026 and achieve mass production by the end of 2027.
Industry impact
The training‑inference split reflects a broader industry shift toward specialized AI silicon. Competitors such as Amazon’s Trainium + Inferentia, Microsoft’s in‑house chips, and Nvidia’s Blackwell series also pursue inference‑focused designs. Google’s dual‑chip strategy intensifies competition, moving the AI compute market from single‑vendor dominance to a multi‑player arena and promising lower cost and higher efficiency for large‑scale AI agents.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
