Artificial Intelligence 15 min read

Efficient Training of Large Models with the Open‑Source Distributed Framework Easy Parallel Library (EPL)

This article introduces the challenges of scaling deep‑learning model training, explains the design and components of the open‑source Easy Parallel Library (EPL) that unifies data, pipeline, and operator‑split parallelism, and demonstrates its best‑practice results on large‑scale classification, BERT‑large, and massive multimodal models.

DataFunSummit

Apr 2, 2023

Efficient Training of Large Models with the Open‑Source Distributed Framework Easy Parallel Library (EPL)

The talk presents the Easy Parallel Library (EPL), an open‑source distributed deep‑learning framework that unifies multiple parallel strategies—data parallelism, pipeline (stream) parallelism, and operator‑split parallelism—allowing them to be combined and nested with minimal user code changes.

It first outlines the rapid growth of model parameters and the resulting training challenges, including the limits of single‑GPU training, the need for model parallelism, and the drawbacks of existing frameworks that support only a single parallel strategy or require extensive code modifications.

EPL’s architecture is described in four layers: an easy‑to‑use interface compatible with TensorFlow, a middle‑expression layer that converts models and strategies into TaskGraph, ParallelStrategy, and VirtualDevice abstractions, a parallel‑engine layer that performs strategy exploration, memory and communication optimizations, and a runtime layer that generates a distributed TensorFlow graph for execution.

The framework provides two primitive annotations— replicate for data parallelism and split for operator‑split parallelism—enabling users to express any parallel configuration. Examples show simple data parallelism, nested pipeline‑plus‑data parallelism, and split‑plus‑data parallelism, each requiring only a few lines of annotation.

EPL also includes advanced optimizations such as automatic gradient checkpoint selection, ZeRO memory‑saving levels (V0‑V2), CPU offload, fine‑grained communication grouping, and topology‑aware All2All operators, all configurable without model‑side changes.

Performance case studies demonstrate that on a 64‑GPU cluster, EPL achieves up to 14.8× speedup for a large‑scale image classification model, 2.32× improvement for BERT‑large using mixed pipeline‑data parallelism, and successful training of trillion‑parameter multimodal models (M6) with only a handful of code modifications, leveraging MoE, checkpointing, offload, and communication optimizations.

The framework is open‑source, with example code and a paper published at ATC 22, and the presenters invite the community to try EPL and join their discussion groups.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed training model parallelism parallelism Large‑Scale Training ZeRO EPL gradient checkpoint

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.