Artificial Intelligence 11 min read

How Whale Enables Efficient Giant Model Training on Heterogeneous GPUs

The article introduces Whale, an open‑source distributed training framework that unifies multiple parallelism strategies, uses hardware‑aware load balancing to accelerate giant models like BERT‑Large and the trillion‑parameter M6 on heterogeneous GPU clusters, and details its architecture, planning, and real‑world performance gains.

Alibaba Cloud Big Data AI Platform

Jul 12, 2022

How Whale Enables Efficient Giant Model Training on Heterogeneous GPUs

Recently, Alibaba Cloud Machine Learning PAI's paper "Whale: Efficient Giant Model Training over Heterogeneous GPUs" was accepted at USENIX ATC'22.

Whale (open‑sourced as EPL) is a self‑developed distributed training framework that unifies multiple parallelism strategies, provides memory, compute and communication optimizations, and offers a simple annotation‑based interface for mixed parallelism. Its hardware‑aware load‑balancing algorithm speeds up BERT‑Large, ResNet‑50 and GNMT by 1.2‑1.4× on heterogeneous GPUs, and enables trillion‑parameter M6 training on 480 V100 GPUs in three days, saving over 80% of compute and improving efficiency by ~11×.

Background and Challenges

Model parameter sizes have exploded; before 2012 compute time doubled every two years, matching Moore's law, but after 2012 it doubles every 3.4 months, outpacing hardware advances. Existing frameworks (Horovod, TensorFlow Estimator, PyTorch DDP, Gpipe, PipeDream, Mesh TensorFlow, FlexFlow, OneFlow, MindSpore) support limited parallel strategies and require extensive code changes, especially on heterogeneous GPU clusters.

Solution

Whale introduces two distributed primitives (replicate and split) and a strategy‑annotation mechanism that lets users express and combine parallelism with a few lines of code. The runtime automatically integrates annotations into the computation graph, performs hardware‑aware automatic parallel strategy selection, and balances load across heterogeneous GPUs.

Technical Architecture

Interface Layer: TensorFlow‑based programming interface with easy parallelism annotations.

Intermediate Representation Layer: Converts models and strategies into TaskGraph, VirtualDevices, and abstract policies.

Parallel Engine Layer: Explores strategies, optimizes memory, compute, and communication, and generates distributed graphs.

Runtime Execution Engine: Transforms the distributed graph to TFGraph and executes via TensorFlow runtime.

Parallel Planner

The planner takes the model, annotations, and hardware resources, maps physical devices to VirtualDevices, partitions the model into TaskGraphs, inserts bridge layers when shapes mismatch, and produces an efficient execution plan.

Hardware‑Aware Load Balancing

Two balancing strategies are used: Intra‑TaskGraph balances FLOP‑based compute load across GPUs, adjusting batch sizes or uneven dimension splits; Inter‑TaskGraph balances pipeline stages by placing earlier stages on GPUs with larger memory and proportionally splitting work according to compute capability.

Application Example

With four lines of code, Whale enables mixed data‑parallel and expert‑parallel training of the M6 model, achieving trillion‑parameter pre‑training on 480 V100 GPUs in three days, saving >80% compute and improving speed ~11×; scaling to 512 GPUs yields a usable 10‑trillion‑parameter model in ten days.

Whale is open‑source (EPL) and aims to become a cornerstone for large‑scale deep‑learning training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning large models parallelism hardware-aware scheduling heterogeneous GPUs

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.