Artificial Intelligence 6 min read

Scaling Deep Learning Models: From Depth to Width and Parallelism Strategies

The article reviews how deep learning models have grown deeper and wider, discusses the memory and bandwidth limits of single GPUs, and explains pipeline and sharding techniques—including GPU clusters and TPU pods—to efficiently train large‑scale models in industrial settings.

DataFunSummit

Aug 16, 2021

Scaling Deep Learning Models: From Depth to Width and Parallelism Strategies

Since the introduction of ResNet, deep learning models have become increasingly deep, leading to massive repetition of residual blocks in computer vision and stacked attention layers in NLP. This trend pushes the limits of single‑GPU memory, making it impossible to fit modern models like BERT‑large without optimization.

To address depth growth, engineers adopt model parallelism by splitting a model into pipeline stages across multiple GPUs. Both TensorFlow and PyTorch provide device‑placement controls, enabling straightforward implementation of pipeline parallelism, though achieving high GPU utilization requires careful balancing of compute load and inter‑device communication.

Beyond depth, models are also expanding in width, especially in recommendation systems with large embedding tables and in NLP models such as T5 that increase attention heads and feed‑forward dimensions. Embedding layers can be sharded across host memory and trained with data parallelism, while dense layers require more complex sharding on GPUs, introducing additional communication overhead.

Google’s TPU pods illustrate a hardware solution to the "wide" problem: thousands of TPUs are organized into pods with uniform high‑bandwidth inter‑pod links, effectively creating a "big round bowl" that can handle extremely wide models, including Mixture‑of‑Experts (MoE) layers that generate tens of gigabytes of all‑to‑all traffic per training step.

The author concludes that the driving forces behind complex model parallelism are limited GPU memory and constrained cross‑GPU bandwidth, and that only a tightly integrated hardware‑software stack—whether GPU clusters with sophisticated pipeline/sharding strategies or TPU pods—can fully exploit the potential of ever‑larger deep learning models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts GPU pipeline model parallelism TPU

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.