Artificial Intelligence 12 min read

Why Training Large Language Models Feels Like Alchemy—and How to Master It

This article breaks down the hardware bottlenecks of large‑scale LLM training, explains the Roofline performance model, arithmetic intensity, and how computation and communication costs interact on GPUs and TPUs, offering concrete formulas and examples for efficient scaling.

Baobao Algorithm Notes

Aug 1, 2025

Why Training Large Language Models Feels Like Alchemy—and How to Master It

As model sizes grow, engineering expertise becomes as crucial as research insight for large‑language‑model (LLM) development. The author reflects on the steep engineering challenges of training LLMs and shares a detailed analysis inspired by a technical blog. https://jax-ml.github.io/scaling-book/ The performance of LLM training on hardware is limited by three primary factors:

Compute speed – the number of floating‑point operations a processor can perform per second (FLOPs/s).

Bandwidth – the data transfer rate between memory, caches, and chips (bytes/s).

Total memory capacity – the maximum amount of data a device can hold (bytes).

These constraints form a Roofline model that defines an upper bound (the sum of compute and communication times) and a lower bound (the slower of the two). Optimizing training involves overlapping computation with communication to approach the lower bound.

Where Does Training Time Go?

1. Computation

Deep‑learning models consist of massive matrix multiplications, measured in FLOPs. For example, an NVIDIA H100 can deliver roughly 9.89 × 10¹⁴ bfloat16 FLOPs per second, while a TPU v6e offers about 9.1 × 10¹⁴ FLOPs/s. A model requiring 1 × 10¹² FLOPs would finish in ~1 ms on either accelerator.

2. Intra‑chip communication

Data must move between on‑chip memory (e.g., HBM) and compute cores. H100 provides ~3.35 TB/s HBM bandwidth; TPU v6e provides ~1.6 TB/s. This transfer time becomes significant for large tensors.

3. Inter‑chip communication

When a model spans multiple accelerators, tensors are exchanged across links such as ICI, DCN, or PCIe, each with its own bytes‑per‑second rate. This adds an additional communication component to the total runtime.

Estimating Total Runtime

Lower bound : the larger of the compute‑time or communication‑time estimates.

Upper bound : the sum of compute‑time and communication‑time.

In practice, overlapping compute and communication can bring actual runtime close to the lower bound, often within a factor of two of the ideal.

Arithmetic Intensity

Arithmetic intensity (or operational intensity) measures FLOPs per byte transferred. High intensity means the algorithm performs many calculations per memory access, leading to a compute‑bound regime; low intensity indicates a communication‑bound regime. The peak intensity of a device is its maximum FLOPs/s divided by its bandwidth.

For TPU v5e, peak compute is 1.97 × 10¹⁴ FLOPs/s and bandwidth is 8.2 × 10¹¹ bytes/s, giving a peak intensity of ~240 FLOPs/byte. Any algorithm below this threshold will be limited by bandwidth.

Roofline Diagram

The Roofline chart plots arithmetic intensity on the x‑axis (log scale) and achievable throughput on the y‑axis. Three regions appear:

Red : bandwidth‑limited (communication‑bound).

Yellow : performance improves with higher bandwidth.

Green : compute‑limited (hardware fully utilized).

Matrix Multiplication Example

Consider multiplying X (bf16[B,D]) by Y (bf16[D,F]) to produce Z (bf16[B,F]). The operation reads 2DF + 2BD bytes, performs 2BDF FLOPs, and writes 2BF bytes. Assuming B ≪ D,F, the arithmetic intensity ≈ (2BDF) / (2DF + 2BD + 2BF) ≈ B / (D+F). On TPU v5e with bfloat16, a batch size > 240 tokens makes the kernel compute‑bound.

In a distributed setting with two TPUs, each processes half of the D dimension, then exchanges partial results. Compute time halves, while communication time equals the time to swap the partial matrices. Solving the inequality shows that when the embedding dimension D > 8755, communication no longer dominates; otherwise the workload is communication‑bound.

Thus, in multi‑accelerator training, the key to avoiding a communication bottleneck is to ensure the model’s dimensions (especially D) are large enough relative to the hardware’s bandwidth, not merely increasing batch size.

Understanding these principles—Roofline analysis, arithmetic intensity, and the balance between compute and communication—enables practitioners to design scaling strategies that fully exploit modern GPU/TPU clusters.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU performance engineering Distributed Computing LLM training TPU Arithmetic intensity roofline model

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.