How Alibaba’s Low‑Carbon M6 Model Trains a Trillion‑Parameter AI with 80% Less Energy

Alibaba’s DAMO Academy unveiled the low‑carbon M6 multimodal model, a trillion‑parameter AI trained on just 480 V100 GPUs, achieving over 80% energy reduction and 11‑fold speedup compared to prior trillion‑parameter efforts, and already powering e‑commerce and manufacturing design tools.

ITPUB
ITPUB
ITPUB
How Alibaba’s Low‑Carbon M6 Model Trains a Trillion‑Parameter AI with 80% Less Energy

Low‑Carbon Training of the M6 Trillion‑Parameter Multimodal Model

On 25 June 2024 Alibaba DAMO Academy released a “low‑carbon” version of its M6 giant model. The model has about 1 trillion parameters (≈10× the number of human neurons) and supports multimodal tasks such as image generation, text generation, and visual‑language understanding.

Hardware and Compute Efficiency

Training used 480 NVIDIA V100 32 GB GPUs (≈480 cards) in the EFLOPS cluster.

Energy consumption was reduced by >80 % compared with prior trillion‑parameter trainings that required 3072 A100 GPUs (Nvidia) or 2048 TPU v3 pods (Google).

Effective training speedup ≈11× relative to those baselines.

Algorithmic Optimizations

The efficiency gains stem from three main improvements to the Mixture‑of‑Experts (MoE) framework:

Expert‑parallel strategy : parallelizes the routing of tokens to multiple expert sub‑networks, increasing the model’s capacity without proportionally increasing compute.

Accelerated linear‑algebra kernels : custom kernels for dense and sparse matrix multiplications that exploit GPU tensor cores.

Mixed‑precision training and half‑precision communication : uses FP16/ BF16 for forward/backward passes and reduces communication bandwidth, while preserving model quality (loss < 0.1 % on standard benchmarks).

Training Procedure

Key hyper‑parameters (as reported):

model_size = 1e12  # parameters
batch_size = 2048
learning_rate = 1e-4
precision = "fp16"
optimizer = "AdamW"
num_epochs = 30

Training was performed on Alibaba Cloud PAI platform with the EFLOPS cluster, leveraging distributed data parallelism combined with the expert‑parallel MoE routing.

Performance and Applications

Benchmarks show comparable or slightly better accuracy on multimodal tasks (e.g., VQAv2, COCO caption) relative to larger‑scale models.

Deployed as an AI‑assistant designer on the “Rhino Manufacturing” platform for rapid fashion design and virtual try‑on, reducing design cycle time.

Integrated into Alipay and Taobao for cross‑modal search, copywriting, and image generation.

Future Directions

DAMO Academy plans to further lower carbon footprints, expand real‑world deployments, and investigate theoretical aspects of general‑purpose large models.

multimodal AIMixture of Expertslarge modelGPU efficiencylow carbon AIM6
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.