Artificial Intelligence 18 min read

MoE LLMs: How Alibaba Cloud & NVIDIA Megatron-Core Accelerate Training

This article reviews the evolution of Mixture-of-Experts (MoE) models, details Alibaba Cloud’s collaboration with NVIDIA’s Megatron-Core to build a high-performance MoE framework, and presents extensive training optimizations, benchmark results, conversion tools, and best-practice guidelines for large-scale LLM development and deployment.

Alibaba Cloud Big Data AI Platform

Mar 26, 2024

MoE LLMs: How Alibaba Cloud & NVIDIA Megatron-Core Accelerate Training

Introduction

Alibaba Cloud AI Platform PAI and NVIDIA Megatron-Core teams present their joint work on implementing and optimizing Mixture-of-Experts (MoE) large language models (LLMs).

MoE Overview

MoE splits a dense model into multiple expert sub‑networks, activating only a subset of experts per token via a routing mechanism, which reduces parameter updates and computational cost during training and inference.

Historical Development

Early 1990s concepts of multiple experts laid the theoretical foundation. In 2017 deep learning growth highlighted capacity limits of dense models. Google first combined MoE with RNNs, later introduced MoE into Transformers (Gshard) and Switch Transformers, scaling parameters to the trillion level.

Key Research Works

Sparsely‑Gated MoE introduced softmax gating, later improved with noise injection and auxiliary loss for balanced expert load.

Google Gshard added expert capacity limits, residual bypass, top‑k routing, Sinkhorn load‑balancing, and local‑group dispatching to reduce communication.

Switch Transformer adopted aggressive top‑1 routing and expert dropout to cut computation.

Megablocks expressed MoE layers as block‑sparse matrices using BCSR format for efficient GEMM.

Megatron‑Core MoE Features

Megatron‑Core provides a lightweight, modular framework supporting expert parallelism, 3‑D parallelism (data, tensor, pipeline, sequence), dropless token routing, multiple routing strategies (top‑k, Sinkhorn, z‑loss), GroupedGMM for variable‑length inputs, and optimized CUDA kernels.

PAI‑Megatron‑Patch Toolchain

The PAI‑Megatron‑Patch library converts HuggingFace model checkpoints to Megatron‑Core format, handling layernorm, attention, and MLP weight mapping, enabling seamless migration of open‑source models such as Mixtral 8×7B.

Experimental Results

Training Mixtral 8×7B with 8 experts on 16 GPUs (TP=4, EP=4) achieved loss convergence around 1.9 after 2.4K steps. Subsequent pre‑training and fine‑tuning stages showed stable loss reduction. In a code‑generation benchmark (HumanEvol), fine‑tuned models improved from 45.73 % to 53.05 % accuracy, outperforming Megablocks under comparable resources.

Best‑Practice Guide

Alibaba Cloud PAI offers a complete AI development workflow: data ingestion from OSS/NAS, distributed training on PAI‑DLC or DSW, checkpoint export, offline inference, evaluation, and one‑click deployment to EAS for online serving.

Future Directions

The teams will continue to deepen collaboration, aiming to further improve dense and sparse model training efficiency and contribute to AGI research, inviting developers to join the open‑source community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models MoE Training Optimization Alibaba Cloud Megatron-Core

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.