Scaling Deep Learning Models: From Depth to Width and Parallelism Strategies
The article reviews how deep learning models have grown deeper and wider, discusses the memory and bandwidth limits of single GPUs, and explains pipeline and sharding techniques—including GPU clusters and TPU pods—to efficiently train large‑scale models in industrial settings.
Since the introduction of ResNet, deep learning models have become increasingly deep, leading to massive repetition of residual blocks in computer vision and stacked attention layers in NLP. This trend pushes the limits of single‑GPU memory, making it impossible to fit modern models like BERT‑large without optimization.
To address depth growth, engineers adopt model parallelism by splitting a model into pipeline stages across multiple GPUs. Both TensorFlow and PyTorch provide device‑placement controls, enabling straightforward implementation of pipeline parallelism, though achieving high GPU utilization requires careful balancing of compute load and inter‑device communication.
Beyond depth, models are also expanding in width, especially in recommendation systems with large embedding tables and in NLP models such as T5 that increase attention heads and feed‑forward dimensions. Embedding layers can be sharded across host memory and trained with data parallelism, while dense layers require more complex sharding on GPUs, introducing additional communication overhead.
Google’s TPU pods illustrate a hardware solution to the "wide" problem: thousands of TPUs are organized into pods with uniform high‑bandwidth inter‑pod links, effectively creating a "big round bowl" that can handle extremely wide models, including Mixture‑of‑Experts (MoE) layers that generate tens of gigabytes of all‑to‑all traffic per training step.
The author concludes that the driving forces behind complex model parallelism are limited GPU memory and constrained cross‑GPU bandwidth, and that only a tightly integrated hardware‑software stack—whether GPU clusters with sophisticated pipeline/sharding strategies or TPU pods—can fully exploit the potential of ever‑larger deep learning models.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.