DeepHub IMBA
Mar 25, 2026 · Artificial Intelligence
TPU Architecture and Pallas Kernels: From Memory Hierarchy to FlashAttention
This article explains why TPU programming differs from GPU, describes the explicit HBM‑VMEM‑register data movement required on TPU, introduces the Pallas grid‑BlockSpec‑Ref model, and walks through four progressively more complex kernels—including element‑wise add, tiled dot product, fused RMSNorm with scratch memory, and a production‑grade FlashAttention implementation—showing how each kernel maps to the TPU memory hierarchy and leverages Pallas features such as input_output_aliases and PrefetchScalarGridSpec.
FlashAttentionJAXMemory Hierarchy
0 likes · 20 min read
