MoE LLMs: How Alibaba Cloud & NVIDIA Megatron-Core Accelerate Training

This article reviews the evolution of Mixture-of-Experts (MoE) models, details Alibaba Cloud’s collaboration with NVIDIA’s Megatron-Core to build a high-performance MoE framework, and presents extensive training optimizations, benchmark results, conversion tools, and best-practice guidelines for large-scale LLM development and deployment.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
MoE LLMs: How Alibaba Cloud & NVIDIA Megatron-Core Accelerate Training

Introduction

Alibaba Cloud AI Platform PAI and NVIDIA Megatron-Core teams present their joint work on implementing and optimizing Mixture-of-Experts (MoE) large language models (LLMs).

MoE Overview

MoE splits a dense model into multiple expert sub‑networks, activating only a subset of experts per token via a routing mechanism, which reduces parameter updates and computational cost during training and inference.

MoE diagram
MoE diagram

Historical Development

Early 1990s concepts of multiple experts laid the theoretical foundation. In 2017 deep learning growth highlighted capacity limits of dense models. Google first combined MoE with RNNs, later introduced MoE into Transformers (Gshard) and Switch Transformers, scaling parameters to the trillion level.

MoE history
MoE history

Key Research Works

Sparsely‑Gated MoE introduced softmax gating, later improved with noise injection and auxiliary loss for balanced expert load.

Google Gshard added expert capacity limits, residual bypass, top‑k routing, Sinkhorn load‑balancing, and local‑group dispatching to reduce communication.

Switch Transformer adopted aggressive top‑1 routing and expert dropout to cut computation.

Megablocks expressed MoE layers as block‑sparse matrices using BCSR format for efficient GEMM.

Research diagram
Research diagram

Megatron‑Core MoE Features

Megatron‑Core provides a lightweight, modular framework supporting expert parallelism, 3‑D parallelism (data, tensor, pipeline, sequence), dropless token routing, multiple routing strategies (top‑k, Sinkhorn, z‑loss), GroupedGMM for variable‑length inputs, and optimized CUDA kernels.

Megatron-Core architecture
Megatron-Core architecture

PAI‑Megatron‑Patch Toolchain

The PAI‑Megatron‑Patch library converts HuggingFace model checkpoints to Megatron‑Core format, handling layernorm, attention, and MLP weight mapping, enabling seamless migration of open‑source models such as Mixtral 8×7B.

Patch conversion
Patch conversion

Experimental Results

Training Mixtral 8×7B with 8 experts on 16 GPUs (TP=4, EP=4) achieved loss convergence around 1.9 after 2.4K steps. Subsequent pre‑training and fine‑tuning stages showed stable loss reduction. In a code‑generation benchmark (HumanEvol), fine‑tuned models improved from 45.73 % to 53.05 % accuracy, outperforming Megablocks under comparable resources.

Performance chart
Performance chart

Best‑Practice Guide

Alibaba Cloud PAI offers a complete AI development workflow: data ingestion from OSS/NAS, distributed training on PAI‑DLC or DSW, checkpoint export, offline inference, evaluation, and one‑click deployment to EAS for online serving.

Workflow diagram
Workflow diagram

Future Directions

The teams will continue to deepen collaboration, aiming to further improve dense and sparse model training efficiency and contribute to AGI research, inviting developers to join the open‑source community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsMoETraining OptimizationAlibaba CloudMegatron-Core
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.