Google’s 8th‑Gen TPU Splits Training and Inference – A Direct Challenge to Nvidia’s One‑Chip Dominance

At Next 2026 Google unveiled the 8th‑generation TPU, separating training and inference into two dedicated chips—TPU 8t with 121 ExaFLOPS for massive models and TPU 8i with ultra‑low latency memory—while boosting performance, efficiency, and ecosystem support, signaling a shift toward specialized AI hardware and intensifying competition with Nvidia.

Architects' Tech Alliance
Architects' Tech Alliance
Architects' Tech Alliance
Google’s 8th‑Gen TPU Splits Training and Inference – A Direct Challenge to Nvidia’s One‑Chip Dominance

Launch and overall strategy

On April 22, 2026 at the Next 2026 conference, Google announced the 8th‑generation TPU, abandoning the previous “one‑chip‑does‑all” model and introducing two dedicated chips: TPU 8t for large‑model training and TPU 8i for inference.

TPU 8t – training chip

Co‑designed with Broadcom, TPU 8t integrates 9 600 chips into a logical cluster that shares 2 PB of high‑bandwidth memory and delivers 121 ExaFLOPS of compute—about three times the overall compute performance of the 7th‑gen Ironwood TPU (https://mp.weixin.qq.com/s?__biz=MzAxNzU3NjcxOA==∣=2650764675&idx=1&sn=fe9fab54acb232efea28a384f959f7d4). Power efficiency improves up to 2×. A built‑in SparseCore accelerator provides native FP4 support for irregular memory access patterns. Autonomous fault‑tolerant routing with real‑time telemetry and optical circuit switching (OCS) can reconfigure the hardware topology without human intervention.

TPU 8i – inference chip

Developed with MediaTek, TPU 8i tackles the inference “memory wall”. Each chip contains 288 GB of high‑bandwidth HBM plus 384 MB of on‑chip SRAM (three‑fold increase over the previous generation), allowing the core working set of models to stay on‑chip and halving latency. The boardfly hierarchical network groups four chips as a unit; 36 such units form a massive cluster, with any two chips communicating in at most seven hops. A new collective‑communication engine reduces on‑chip latency by fivefold. Compared with the prior generation, TPU 8i improves cost‑performance by 80 % and per‑watt performance by 117 %.

Ecosystem and software support

Both chips are fabricated on TSMC’s 2 nm process and paired with Google’s custom Arm‑based Axion CPU and fourth‑generation liquid‑cooling system. The TPU 8 series natively supports PyTorch 2.x, eliminating the need for torch_xla compatibility layers, and integrates the Pallas kernel development toolkit for fine‑grained memory control. Google plans to open the chips for use in the second half of 2026 and achieve mass production by the end of 2027.

Industry impact

The training‑inference split reflects a broader industry shift toward specialized AI silicon. Competitors such as Amazon’s Trainium + Inferentia, Microsoft’s in‑house chips, and Nvidia’s Blackwell series also pursue inference‑focused designs. Google’s dual‑chip strategy intensifies competition, moving the AI compute market from single‑vendor dominance to a multi‑player arena and promising lower cost and higher efficiency for large‑scale AI agents.

AI hardwareNvidia competitionAI AcceleratorsGoogle TPUTPU 8iTPU 8tTraining vs Inference
Architects' Tech Alliance
Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.