How to Pick the Right Parallelism for 7B‑70B Models: DP, TP, PP, ZeRO & FSDP

This guide walks engineers through the memory, compute and bandwidth limits of training 7B‑70B models, compares data parallel (DP/DDP), tensor parallel (TP), pipeline parallel (PP), ZeRO stages and FSDP, shows how to calculate GPU memory, estimate communication overhead, configure each strategy, and avoid common pitfalls, enabling you to decide which parallelism to use on multi‑GPU or multi‑node clusters.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Pick the Right Parallelism for 7B‑70B Models: DP, TP, PP, ZeRO & FSDP

Why Distributed Training Is Needed

Training large language models hits three hard constraints:

Memory wall : model + optimizer + gradients + activations exceed a single GPU’s memory.

Compute wall : one epoch can take weeks on a single GPU.

Bandwidth wall : inter‑GPU communication becomes the bottleneck.

Distributed training replaces a single GPU with many GPUs, but each parallelism strategy solves a different part of the problem.

Parallelism Strategies and Their Use Cases

DP / DDP – solves the compute wall. Suitable when the model fits on one GPU but training is slow.

Tensor Parallel (TP) – splits each layer across GPUs to overcome the memory wall for models larger than a single GPU (e.g., 70B+).

Pipeline Parallel (PP) – splits the model across stages to handle very deep models or batch‑size‑1 scenarios.

ZeRO‑1/2/3 – data‑parallel optimisations that shard optimizer state, gradients and/or parameters. ZeRO‑3 gives full parameter sharding.

FSDP – PyTorch native implementation equivalent to ZeRO‑3.

3D Parallel (TP + PP + ZeRO) – combines all three walls for industrial‑scale models (70B‑175B).

Memory Calculation (Must‑Know Formula)

Total GPU memory = Model weights + Optimizer states + Gradients + Activations
                = P + 2P (Adam) + 2P + (batch × seq × hidden × L)
                (FP16 for weights/gradients, FP32 for Adam state)

Example for a 7B FP16 model on a single A100 80 GB:

Model weights: 14 GB

Adam optimizer (FP32): 56 GB

Gradients: 14 GB

Activations (L=32, batch=4, seq=2048): ≈ 20 GB

Total ≈ 104 GB

Thus a 7B model needs at least two 80 GB GPUs even without activations.

Data Parallel (DDP) Walk‑through

Each GPU loads the full model.

Data is sharded with DistributedSampler.

Forward and backward passes compute gradients locally.

All‑reduce synchronises gradients – the most critical step.

Each GPU updates its parameters independently.

Key DDP parameters (PyTorch 2.0+): find_unused_parameters=True – required for models with conditional branches; adds 10‑20 % communication. gradient_as_bucket_view=True – packs gradients into buckets; ~30 % faster than per‑parameter all‑reduce. static_graph=True – when the model graph is static; saves ~20 % communication.

DeepSpeed ZeRO‑3 for a 70B Model

Launch command:

deepspeed --num_gpus=8 --num_nodes=2 train_zero3.py

Key ds_config_zero3.json options (excerpt):

{
  "bf16": {"enabled": true},
  "optimizer": {"type": "AdamW", "params": {"lr": 1e-5, "betas": [0.9, 0.95], "weight_decay": 0.1}},
  "zero_optimization": {
    "stage": 3,
    "offload_optimizer": {"device": "cpu", "pin_memory": true},
    "offload_param": {"device": "cpu", "pin_memory": true},
    "overlap_comm": true,
    "contiguous_gradients": true,
    "stage3_prefetch_bucket_size": "auto",
    "stage3_max_live_parameters": 1e9,
    "stage3_gather_16bit_weights_on_model_save": true
  },
  "gradient_accumulation_steps": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto"
}

Important flags: stage: 3 – full ZeRO (parameter + gradient + optimizer sharding). offload_optimizer/pin_memory: true – moves optimizer shards to CPU, saving >30 % GPU memory at ~20 % speed cost. overlap_comm: true – overlaps communication with computation. stage3_prefetch_bucket_size: "auto" – prefetches parameters to reduce waiting time.

FSDP (PyTorch Native, Recommended After 2024)

Minimal runnable example (PyTorch 2.0+):

import torch, os, functools
from torch.distributed import init_process_group
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, ShardingStrategy
from torch.distributed.fsdp import MixedPrecision, BackwardPrefetch
from torch.distributed.fsdp.wrap import transformer_auto_wrap_policy
from transformers import AutoModelForCausalLM

# 1. Initialise process group
init_process_group(backend="nccl")
local_rank = int(os.environ["LOCAL_RANK"])
torch.cuda.set_device(local_rank)

# 2. Load model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf", torch_dtype=torch.bfloat16)

# 3. Auto‑wrap each transformer layer
wrap_policy = functools.partial(
    transformer_auto_wrap_policy, transformer_layer_cls={LlamaDecoderLayer})

mp_policy = MixedPrecision(param_dtype=torch.bfloat16,
                           reduce_dtype=torch.bfloat16,
                           buffer_dtype=torch.bfloat16)

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,  # equivalent to ZeRO‑3
    mixed_precision=mp_policy,
    auto_wrap_policy=wrap_policy,
    backward_prefetch=BackwardPrefetch.BACKWARD_PRE,
    forward_prefetch=True,
    device_id=local_rank,
    limit_all_gathers=True,
    use_orig_params=True)

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for batch in loader:
    loss = model(**batch).loss
    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

Key FSDP parameters: sharding_strategy=FULL_SHARD – same memory savings as ZeRO‑3. auto_wrap_policy – wraps each transformer layer as an independent shard, cutting communication by ~50 % compared with whole‑model sharding. backward_prefetch=BACKWARD_PRE – prefetches the next layer during back‑prop, 10‑15 % faster. limit_all_gathers=True – caps simultaneous all‑gather ops to avoid OOM. use_orig_params=True – keeps original parameter references so standard optimizers work.

3D Parallelism (TP + PP + ZeRO) for 70B+ Models

Implemented via Megatron‑LM or DeepSpeed + Accelerate. Example DeepSpeed YAML snippet:

compute_environment: LOCAL_MACHINE
distributed_type: DEEPSPEED
deepspeed_config:
  gradient_accumulation_steps: 16
  zero_optimization:
    stage: 3
    offload_optimizer: {device: cpu}
  tensor_parallel:
    tp_size: 2
  pipeline_parallel:
    pp_size: 2
mixed_precision: bf16

Communication breakdown for TP=2, PP=2, DP=4 on 8 GPUs:

TP : each layer performs 2 all‑reduce operations (high‑frequency, bandwidth‑sensitive).

PP : point‑to‑point transfers between pipeline stages (lower frequency).

ZeRO‑3 DP : all‑gather / reduce‑scatter across 4 GPUs (medium frequency).

Overall communication cost order: TP > ZeRO‑3 > PP.

Evaluating Training Effectiveness

Efficiency Metrics

Throughput – tokens / second / GPU. Higher is better; proportional to hardware compute.

MFU – Model FLOPs Utilisation. Healthy values: 40‑50 % on A100, 50‑60 % on H100.

HTU – Hardware FLOPs Utilisation (same interpretation as MFU).

Step time – time per training step; should be stable and as low as possible.

Communication % – comm_time / step_time; <30 % is considered good.

Sample throughput measurement after 100 steps:

start = time.time()
for step, batch in enumerate(loader):
    if step == 100:
        t = time.time() - start
        tokens_per_sec = 100 * batch_size * seq_len * num_gpus / t
        print(f"throughput: {tokens_per_sec:.0f} tokens/s")
        break

Memory Monitoring

print(f"GPU {rank} mem: {torch.cuda.max_memory_allocated()/1e9:.2f} GB")
# or torch.cuda.memory_summary()

Convergence Monitoring

Track loss curves, gradient norms, per‑GPU loss synchronisation, and evaluation metrics across epochs to ensure distributed training matches a single‑GPU baseline.

Common Pitfalls (12 Practical Tips)

NCCL timeout : set NCCL_TIMEOUT, enable InfiniBand ( NCCL_IB_DISABLE=0) and turn on NCCL_DEBUG=INFO.

CPU offload slowdown : avoid offloading when PCIe bandwidth is the bottleneck; use more GPUs or NVMe offload instead.

ZeRO‑3 checkpoint incompatibility : only rank 0 should gather and save; load with ds_engine.load_checkpoint.

Wrong checkpoint granularity : checkpoint at transformer‑layer level, not per linear layer, to balance memory saving and speed.

TP/PP mis‑alignment : ensure num_attention_heads % tp_size == 0 and hidden_size % tp_size == 0. If not, choose a different parallelism.

Missing use_orig_params in FSDP : required for optimizer compatibility.

Incorrect gradient_accumulation_steps : large effective batch size needs learning‑rate scaling.

Unbalanced data with DistributedSampler : use length‑aware or group‑by‑length samplers.

FP16 overflow : prefer BF16; if using FP16 enable GradScaler and monitor loss.item() for NaNs.

Multi‑node init hang : verify network connectivity, open required ports, or test with the Gloo backend.

TP size not dividing hidden size : TP must be a factor of both head count and hidden dimension.

Slow ZeRO‑3 checkpointing : enable stage3_gather_16bit_weights_on_model_save: true and use the safetensors format.

Optimization Directions (From "Can Use" to "Good to Use")

Mixed‑precision (BF16) – 30‑50 % speedup, zero extra cost.

Flash Attention 2 – +30 % speed, –20 % memory.

Gradient checkpointing – –30‑50 % memory, –20 % speed.

ZeRO‑2 / FSDP – adds optimizer sharding.

CPU offload for optimizer – saves memory, ~20 % slower.

Fused optimizers (e.g., FusedAdam) – ~30 % memory saving.

Communication overlap – 10‑20 % speedup.

Sequence parallelism – required for >8k token sequences.

ZeRO‑Infinity / ZeRO++ – extreme optimisation.

Full 3D parallelism – ultimate scalability for >70B models.

Decision Flow

Run single‑GPU DDP, verify convergence.

Add ZeRO‑2 or FSDP to fit memory.

Introduce TP/PP when compute becomes the bottleneck.

Scale to multi‑node; validate network bandwidth.

Deploy monitoring (throughput, memory, loss) and enable elastic scheduling.

Final Recommendations by Model Size

< 7B – single‑GPU DDP or LoRA fine‑tuning; no distributed training needed.

7B‑13B – DDP + ZeRO‑2.

13B‑70B – FSDP or ZeRO‑3 with TP=2‑4.

70B+ – 3D parallelism (TP + PP + ZeRO‑3).

Common mistakes to avoid:

Using ZeRO‑3 for small models – communication overhead outweighs benefits.

Applying DDP to models that cannot fit – leads to OOM.

Skipping a single‑GPU baseline before scaling – most distributed‑training bugs disappear once the baseline works.

When distributed training itself becomes a bottleneck (high cost, slow speed, hard debugging), reconsider the model size or use techniques such as LoRA, knowledge distillation, or quantisation instead of pushing for ever larger models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DeepSpeeddistributed trainingmodel parallelismZeROdata parallelismFSDP
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.