Artificial Intelligence 29 min read

How Distributed Training Powers Massive Language Models: Concepts, Strategies, and Code

This article explains why single‑machine resources are insufficient for training ever‑larger language models, introduces the fundamentals of distributed training systems, details various parallel strategies such as data, model, pipeline, and hybrid parallelism, and provides practical PyTorch code and memory‑optimization techniques to accelerate large‑scale model training.

Huawei Cloud Developer Alliance

Sep 18, 2024

How Distributed Training Powers Massive Language Models: Concepts, Strategies, and Code

Distributed Training Overview

As language‑model parameters and required training data grow rapidly, a single machine cannot meet the resource demands. Distributed training systems split a training task into sub‑tasks and allocate them to many compute devices to handle massive computation and memory requirements.

System Architecture

Training on a single device uses CPUs, GPUs, TPUs or NPUs. In a distributed setting, multiple devices (possibly across several servers) work in parallel, each performing a part of the computation and later merging results to obtain the same outcome as a single device.

Training Speed Formula

Total training speed ∝ single‑device compute speed × number of devices × multi‑device acceleration ratio.

Parallel Strategies

Data Parallelism

Each device holds a full model replica and processes a distinct mini‑batch. After forward and backward passes, gradients are averaged across devices. PyTorch code using DistributedSampler and torch.distributed.launch is provided.

class DistributedSampler(Sampler):
    def __init__(self, dataset, num_replicas=None, rank=None, shuffle=True, seed=0):
        if num_replicas is None:
            if not dist.is_available():
                raise RuntimeError("Requires distributed package to be available")
            num_replicas = dist.get_world_size()
        if rank is None:
            if not dist.is_available():
                raise RuntimeError("Requires distributed package to be available")
            rank = dist.get_rank()
        self.dataset = dataset
        self.num_replicas = num_replicas
        self.rank = rank
        self.epoch = 0
        self.num_samples = int(math.ceil(len(self.dataset) * 1.0 / self.num_replicas))
        self.total_size = self.num_samples * self.num_replicas
        self.shuffle = shuffle
        self.seed = seed

    def __iter__(self):
        if self.shuffle:
            g = torch.Generator()
            g.manual_seed(self.seed + self.epoch)
            indices = torch.randperm(len(self.dataset), generator=g).tolist()
        else:
            indices = list(range(len(self.dataset)))
        indices += indices[:(self.total_size - len(indices))]
        assert len(indices) == self.total_size
        indices = indices[self.rank:self.total_size:self.num_replicas]
        assert len(indices) == self.num_samples
        return iter(indices)

    def __len__(self):
        return self.num_samples

    def set_epoch(self, epoch):
        self.epoch = epoch

Launch command:

CUDA_VISIBLE_DEVICES=0,1 python -m torch.distributed.launch --nproc_per_node=2 main.py

Model Parallelism

Model parameters are split across devices to overcome memory limits. Two forms are layer‑wise (pipeline parallelism) and tensor‑wise (tensor parallelism). Examples illustrate embedding, matrix‑multiply, and softmax partitioning.

Pipeline Parallelism

Model layers are divided into stages placed on different devices, forming a pipeline. Techniques such as GPipe and 1F1B reduce pipeline bubbles and improve device utilization.

Hybrid Parallelism

Combines data, tensor, and pipeline parallelism. Large‑scale models like BLOOM use the Megatron‑DeepSpeed framework with ZeRO optimizer, achieving efficient training on hundreds of GPUs.

Memory Optimizations

Training large models with Adam requires storing gradients, first‑order and second‑order moments, which dominate memory usage. Mixed‑precision training (FP16/BF16) and dynamic loss scaling reduce memory pressure, while activation checkpointing further saves memory.

Additional Distributed Tensor APIs

import torch
from torch.distributed._tensor import DTensor, DeviceMesh, Shard, distribute_tensor

device_mesh = DeviceMesh("cuda", [0, 1, 2, 3])
rowwise_tensor = distribute_tensor(torch.randn(888, 12), device_mesh=device_mesh, placements=[Shard(0)])

Effective distributed training must overcome compute, memory, and communication walls to fully utilize cluster resources.

deep learning large language models GPU PyTorch parallelism

Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.