How veGiantModel Boosts Large Language Model Training Up to 6.9× Faster
The article introduces Volcano Engine's veGiantModel, a high‑performance large‑model training framework built on PyTorch, Megatron and DeepSpeed, details its distributed parallel strategies, hardware setups, benchmark results showing up to 6.9× speedup over Megatron and DeepSpeed, and provides open‑source links for further use.
Background
In recent years, breakthroughs in NLP such as BERT, GPT, and GPT‑3 have shown that larger models tend to perform better, leading to rapid growth in model size and associated challenges in memory, compute, and communication.
Volcano Engine Large Model Training Framework veGiantModel
To address these challenges, ByteDance's AML team developed veGiantModel, a high‑performance training framework based on PyTorch, Megatron, and DeepSpeed. Its key features include:
Support for three distributed parallel strategies—data parallelism, operator splitting, and pipeline parallelism—along with automated and customizable parallel policies.
Integration of ByteCCL, a high‑performance asynchronous communication library, delivering 1.2×‑3.5× higher throughput compared to other open‑source frameworks.
More flexible pipeline support that reduces development effort.
Efficient scaling on GPUs for models ranging from billions to hundreds of billions of parameters.
Low bandwidth requirements without strong dependence on RDMA for private deployments.
ByteCCL, an upgraded version of BytePS, optimizes communication primitives such as allgather and alltoall for various GPU topologies.
veGiantModel Performance
Hardware Configuration
Benchmarks were conducted on in‑house servers using A100 and V100 GPUs:
V100 test: 8× Tesla V100 32 GB per machine, 100 Gb/s network.
A100 test: 8× Ampere A100 40 GB per machine, 800 Gb/s network.
Model and Baseline Selection
The GPT‑13B model (seq length 256, global batch size 1536) served as the test model. Baselines were the popular open‑source Megatron and DeepSpeed frameworks.
Test Results
Model: GPT‑13B
Megatron: v2.4, tensor‑model‑parallel‑size = 4, pipeline‑model‑parallel‑size = 4
DeepSpeed: v0.4.2, using the default ZeRO‑3 configuration.
Running environments included V100/TCP, V100/RDMA, A100/TCP, and A100/RDMA with 4‑node clusters.
Throughput (samples/s) results show that veGiantModel consistently outperforms Megatron and DeepSpeed on both V100 and A100, achieving up to 6.9× higher speed. Moreover, veGiantModel’s performance is less sensitive to network bandwidth, with less than 10% variation, whereas DeepSpeed’s throughput can drop up to 5× under lower bandwidth.
Reason Analysis
ByteCCL provides high‑performance asynchronous communication.
Customizable parallel strategies enable extreme performance tuning.
When employing data, operator‑split, and pipeline parallelism, veGiantModel automatically adjusts topology placement based on cross‑node bandwidth.
Resources
veGiantModel is open‑source on GitHub:
Detailed usage instructions and a quick start for GPT pre‑training are available on Volcano Engine’s Machine Learning Platform (public beta):
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
