Parallel Training of 100B‑Parameter Models: Intra‑Node Tensor Parallelism and Inter‑Node Data Parallelism
Training 100‑billion‑parameter Transformers is limited by GPU memory rather than compute, requiring a mix of tensor parallelism within nodes and data parallelism across nodes, along with pipeline parallelism, gradient accumulation, and careful framework choices to balance memory, bandwidth, and compute overheads.
