Parallel Training of 100B‑Parameter Models: Intra‑Node Tensor Parallelism and Inter‑Node Data Parallelism
Training 100‑billion‑parameter Transformers is limited by GPU memory rather than compute, requiring a mix of tensor parallelism within nodes and data parallelism across nodes, along with pipeline parallelism, gradient accumulation, and careful framework choices to balance memory, bandwidth, and compute overheads.
Memory Limits: Why a 100B‑Parameter Model Won’t Fit on a Single GPU
A 100‑billion‑parameter Transformer requires roughly 400 GB of FP32 weights (200 GB in bfloat16). Beyond weights, training needs optimizer states (doubling memory), activations (200–500 GB depending on batch size and sequence length), and gradient buffers, pushing total memory needs to 800 GB–1.2 TB. An H100 GPU offers only 80 GB, so even 10–15 GPUs are needed just for data storage.
Data Parallelism: The Simplest Baseline
In data parallelism each GPU holds a full model copy, processes different batches, and synchronizes gradients via an all‑reduce. For a 100B model, each replica consumes 3.2 TB across 16 GPUs, and gradient synchronization alone transfers about 200 GB per step, consuming ~0.22 s on a 900 GB/s NVLink link—roughly 40 % of the step time. Scaling beyond 100B makes this approach untenable.
Tensor Parallelism: Splitting Weight Matrices Across GPUs
Tensor parallelism (intra‑layer parallelism) shards each layer’s weight matrix across GPUs. For a Transformer’s multi‑head attention projection, the matrix [hidden_dim, num_heads * head_dim] is split column‑wise, e.g., GPU 0 holds columns 0‑N/2 and GPU 1 holds N/2‑N. Forward passes compute partial outputs that are concatenated; backward passes split gradients similarly. Communication uses all‑gather instead of all‑reduce, and with 8‑way tensor parallelism each GPU stores only 1/8 of the parameters, eliminating full‑model replication.
Pipeline Parallelism: Dividing Layers Across GPUs
Pipeline parallelism assigns different Transformer blocks to different GPUs. Forward activations flow left‑to‑right, gradients flow right‑to‑left. While memory per GPU drops proportionally (e.g., a 100‑layer model split 4‑way reduces memory by 4×), pipeline bubbles can cause low GPU utilization—often only one GPU works at a time, dropping utilization to ~25 %.
Practical Best‑Fit: Tensor Parallelism + Data Parallelism
Most teams training 100B‑scale models adopt a hybrid strategy: tensor parallelism inside each node (leveraging fast NVLink at ~900 GB/s) and data parallelism across nodes (using slower cluster interconnects at 400–800 Gbps). This keeps model shards on each GPU, confines high‑frequency tensor‑parallel communication to the fast intra‑node link, and performs a single data‑parallel synchronization per step.
Gradient Accumulation: Simulating Larger Batches Without Extra Memory
Gradient accumulation aggregates gradients over multiple micro‑batches before an optimizer step, effectively increasing batch size without requiring the memory for a single large batch. It also helps keep pipeline stages busy, though it multiplies the number of forward‑backward passes per update, increasing wall‑clock time.
Choosing a Parallel Strategy: Decision Tree
Can the model fit on one GPU? (Rare for 100B unless int4 quantized) → Use data parallelism only.
Can it fit on a single node (e.g., 8 GPUs)? → Apply intra‑node tensor parallelism (e.g., 8‑way) plus inter‑node data parallelism.
Is it still too large for a node? → Combine 4‑way tensor parallelism, 2‑way pipeline parallelism, and broader data parallelism.
Is the model deep with low latency? → Consider lightweight pipeline parallelism after measuring actual overhead.
Is network bandwidth ample? → 3‑D parallelism (tensor + pipeline + data) may be justified.
Why Communication Becomes the Bottleneck
On an 8‑GPU node, a typical step includes ~300 ms compute, ~200 ms gradient all‑reduce, ~50 ms tensor‑parallel all‑gather (overlapped), and 0–100 ms pipeline bubbles. Total step time reaches 500–600 ms, with computation accounting for only ~50 %; improving compute speed yields diminishing returns because network latency dominates.
Frameworks: PyTorch FSDP vs. Megatron‑LM
FSDP (Fully Sharded Data Parallel) shards optimizer state, gradients, and parameters across GPUs, offering memory efficiency with simpler usage but potentially higher synchronization cost if not tuned. Megatron‑LM provides explicit tensor, pipeline, and sequence parallelism, giving fine‑grained control at the cost of complexity. Large‑scale training typically picks one of these based on model‑framework compatibility.
Conclusion
The primary bottleneck for 100B‑parameter models is memory capacity and the ability to keep GPUs busy while moving data. Parallel strategy choices—data, tensor, and pipeline parallelism, along with gradient accumulation—determine how model state is stored and communicated. Successful teams focus on reducing waiting time rather than merely increasing raw throughput.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
