Artificial Intelligence 12 min read

How Alibaba’s Whale Framework Cuts Large‑Model Training Costs by 80%

Alibaba Cloud’s PAI team and the DAMO Academy introduced the low‑carbon M6 trillion‑parameter multimodal model, demonstrating that their self‑developed Whale framework can train such massive models on just 480 V100 GPUs, reducing energy consumption by over 80% and boosting training efficiency nearly eleven‑fold.

Alibaba Cloud Developer

Aug 17, 2021

How Alibaba’s Whale Framework Cuts Large‑Model Training Costs by 80%

Recently, Alibaba Cloud’s PAI team and the DAMO Academy released a low‑carbon version of the giant M6 model, dramatically lowering the energy consumption of trillion‑parameter super‑large model training. Using the self‑developed Whale framework with only 480 GPU cards, they trained a multimodal trillion‑parameter model ten times larger than the human brain, achieving more than an 80% reduction in energy use and nearly an 11× increase in efficiency compared with traditional overseas solutions.

M6 is the first domestically commercialized multimodal large model, possessing cognitive and creative abilities beyond traditional AI, excelling in painting, writing, and Q&A, with broad application prospects in e‑commerce, manufacturing, literature, and the arts.

1. Model Development Trends and Challenges

1. Model Development Trends

Before 2012, model computation time doubled every two years, consistent with Moore’s law.

After 2012, model computation time doubled every 3.4 months, far outpacing hardware advances.

In the past year, model parameter scales have surged, with Google, Nvidia, Alibaba, and Zhiyuan Institute releasing trillion‑parameter models, while others released hundred‑billion and trillion‑parameter models. Larger models yield lower perplexity and higher translation quality, as shown by Nvidia’s BERT tests and Google’s GShard MoE‑Transformer results.

2. Challenges of Training Large Models

Training difficulty: GPU memory cannot hold model replicas; data parallelism is insufficient.

Need for new parallel strategies to store and train models across multiple GPUs.

Providing simple, user‑friendly interfaces for distributed model deployment.

Improving computational and communication efficiency for ultra‑large models.

High cost: a trillion‑parameter model requires ~4 TB of parameters and gradients, demanding massive GPU memory and expensive hardware (e.g., 3072 Nvidia A100 or 2048 Google TPU v3).

Reducing cost and increasing training speed with fewer resources.

Existing distributed training frameworks (Horovod, TensorFlow Estimator, PyTorch DDP, Gpipe, PipeDream, Mesh TensorFlow, FlexFlow, OneFlow, MindSpore) have limitations such as single‑mode support, high entry barriers, large migration costs, and suboptimal performance.

2. PAI’s Self‑Developed Whale Framework

1. Whale Architecture

Whale unifies multiple parallel strategies in a high‑performance distributed training framework, addressing the challenges above.

Unified abstraction and encapsulation of various parallel strategies.

TensorFlow‑compatible distributed parallel interface; users add a few annotations to enable rich parallelism.

Scheduling and communication optimization based on model structure and network topology.

Whale consists of four modules:

API: provides simple interfaces for combining mixed parallel strategies.

Whale IR: transforms parallel strategies into internal representations using TaskGraph, Multi‑Dimension, and VirtualDevices.

Whale Engine: builds distributed execution graphs from Whale IR.

Runtime: converts execution graphs to TensorFlow graphs and invokes TensorFlow’s runtime.

2. Whale Easy‑to‑Use Interfaces

Key primitives:

cluster: configure virtual device partitioning.

replica: data parallelism.

stage: divide TaskGraph.

pipeline: pipeline parallelism.

split: operator splitting.

These primitives can be combined to implement various parallel strategies, such as pure data parallelism, pipeline parallelism, and hybrid pipeline + data parallelism, illustrated by the following diagrams:

3. Whale Training Workflow

Parallel strategy configuration via Whale API with minimal annotations.

Model partitioned into multiple TaskGraphs, each supporting distinct parallel strategies.

Virtual resource division: each TaskGraph maps to a Virtual Device.

Physical device assignment based on GPU resources and network topology.

Distributed execution graph construction using graph editing tools (copy, split, insert communication nodes).

Execution of the final distributed graph via TensorFlow runtime.

3. Pre‑training the Trillion‑Parameter M6 Model

To reduce compute demand, Whale implements a Mixture‑of‑Experts (MoE) layer with sparse activation, using a gating router to select top‑k experts (k=1 or 2), dramatically cutting compute requirements.

Whale’s MoE layer supports expert parallelism, distributing experts across multiple devices, while data parallelism boosts concurrency. The M6 model uses a hybrid DP + EP strategy: MoE layers employ expert parallelism, other layers use data parallelism.

Training optimizations include:

Auto Gradient Checkpoint for activation memory savings.

Group‑wise Apply to reduce optimizer memory usage.

CPU Offload for optimizer state and weight memory.

Communication pooling to control data block size and concurrency.

DP + EP hybrid parallelism to lower compute demand.

Grouped fusion communication, half‑precision communication, topology‑aware All2All operators for communication efficiency.

Mixed‑precision and compiler optimizations for faster training.

Using Whale, the M6 model was pretrained on 480 V100 GPUs in three days, achieving over 80% resource savings and nearly an 11× speedup compared with previous approaches using thousands of A100 GPUs or TPU v3 pods.

4. Conclusion

Model parameter scales continue to grow, making large‑model training a key trend. Whale unifies diverse parallel strategies, offers simple annotation‑based interfaces, and performs hardware‑aware optimizations, enabling efficient training of trillion‑parameter models with modest resources. Future work will expand Whale’s capabilities in scale, speed, and cost‑effectiveness, and apply it to more business scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI large models distributed training GPU Optimization Whale framework

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.