Artificial Intelligence 6 min read

How veGiantModel Boosts Large Language Model Training Up to 6.9× Faster

The article introduces Volcano Engine's veGiantModel, a high‑performance large‑model training framework built on PyTorch, Megatron and DeepSpeed, details its distributed parallel strategies, hardware setups, benchmark results showing up to 6.9× speedup over Megatron and DeepSpeed, and provides open‑source links for further use.

Volcano Engine Developer Services

Mar 16, 2022

How veGiantModel Boosts Large Language Model Training Up to 6.9× Faster

Background

In recent years, breakthroughs in NLP such as BERT, GPT, and GPT‑3 have shown that larger models tend to perform better, leading to rapid growth in model size and associated challenges in memory, compute, and communication.

Volcano Engine Large Model Training Framework veGiantModel

To address these challenges, ByteDance's AML team developed veGiantModel, a high‑performance training framework based on PyTorch, Megatron, and DeepSpeed. Its key features include:

Support for three distributed parallel strategies—data parallelism, operator splitting, and pipeline parallelism—along with automated and customizable parallel policies.

Integration of ByteCCL, a high‑performance asynchronous communication library, delivering 1.2×‑3.5× higher throughput compared to other open‑source frameworks.

More flexible pipeline support that reduces development effort.

Efficient scaling on GPUs for models ranging from billions to hundreds of billions of parameters.

Low bandwidth requirements without strong dependence on RDMA for private deployments.

ByteCCL, an upgraded version of BytePS, optimizes communication primitives such as allgather and alltoall for various GPU topologies.

veGiantModel Performance

Hardware Configuration

Benchmarks were conducted on in‑house servers using A100 and V100 GPUs:

V100 test: 8× Tesla V100 32 GB per machine, 100 Gb/s network.

A100 test: 8× Ampere A100 40 GB per machine, 800 Gb/s network.

Model and Baseline Selection

The GPT‑13B model (seq length 256, global batch size 1536) served as the test model. Baselines were the popular open‑source Megatron and DeepSpeed frameworks.

Test Results

Model: GPT‑13B

Megatron: v2.4, tensor‑model‑parallel‑size = 4, pipeline‑model‑parallel‑size = 4

DeepSpeed: v0.4.2, using the default ZeRO‑3 configuration.

Running environments included V100/TCP, V100/RDMA, A100/TCP, and A100/RDMA with 4‑node clusters.

Throughput (samples/s) results show that veGiantModel consistently outperforms Megatron and DeepSpeed on both V100 and A100, achieving up to 6.9× higher speed. Moreover, veGiantModel’s performance is less sensitive to network bandwidth, with less than 10% variation, whereas DeepSpeed’s throughput can drop up to 5× under lower bandwidth.

Reason Analysis

ByteCCL provides high‑performance asynchronous communication.

Customizable parallel strategies enable extreme performance tuning.

When employing data, operator‑split, and pipeline parallelism, veGiantModel automatically adjusts topology placement based on cross‑node bandwidth.

Resources

veGiantModel is open‑source on GitHub:

Detailed usage instructions and a quick start for GPT pre‑training are available on Volcano Engine’s Machine Learning Platform (public beta):

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Models distributed training performance benchmarking ByteCCL veGiantModel

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.