How Alibaba’s TePDist Automates Distributed Deep Learning for Large Models
Alibaba Cloud’s PAI platform unveils TePDist, an HLO‑based automatic distributed deep‑learning system that decouples strategy search from model code, offers client/server architecture, supports SPMD and pipeline parallelism, delivers high performance on GPT, MoE and other models, and is now open‑source.
Alibaba Cloud Machine Learning Platform PAI has officially released TePDist, a self‑developed, HLO‑based fully automatic distributed deep‑learning system.
TePDist decouples distributed strategy search from the user’s model‑building language, maintaining generality while automatically exploring high‑performance strategies within acceptable search times, requiring no changes to the model code.
Beyond a distributed compiler, TePDist includes its own runtime to implement automatically searched parallel strategies. It follows a client/server architecture: the server receives HLO IR, explores and applies distributed parallel strategies; the client converts user models into HLO IR, allowing any HLO‑capable client to connect.
Functionally, TePDist consists of two parts: (1) strategy search on HLO IR for SPMD (data parallel and sharding) and pipeline parallelism, building a task‑graph execution plan; (2) an efficient distributed execution engine. It offers multiple optimization levels, with higher levels prioritising strategy quality and lower levels using heuristics for faster search.
Performance tests on GPT and MoE models with mixed SPMD + pipeline strategies show TePDist achieving 62 % and 58 % of peak capability respectively, and its generality is validated on VGG‑19, DNABert and UNet models.
Recognising the industrial potential of large models such as ChatGPT, Alibaba Cloud PAI also announced that TePDist is open‑sourced to help AI developers build faster, better automatic distributed systems.
Open‑source repository: https://github.com/alibaba/TePDist
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
