Boost Multimodal Model Training Efficiency with Offline Sequence Packing and Mixed‑Modality Data
Baidu's Baige team introduces an extended multimodal data loader, automated ShareGPT format conversion, and offline sequence packing techniques that together double token throughput, cut SFT training time by up to six times, and improve GPU utilization and stability for large vision‑language models.
Multimodal Data Loading for Complex Dialogues
The original Megatron‑Energon loader supports only single‑turn QA samples with at most one image or video. To train vision‑language models (VLMs) on heterogeneous data (text, images, video) the Baige team extended the built‑in DataLoader in the AIAK‑Training‑LLM + Megatron + Energon stack:
Added separate encoders for text‑image, text‑video and other modality pairs.
Enabled full multi‑turn dialogue structures.
Allowed an arbitrary number of images or video frames per sample.
Provided a customizable prompt‑template mechanism so ShareGPT‑style conversations can be ingested without manual preprocessing.
Automated Conversion of ShareGPT‑style Datasets
Megatron‑Energon’s native conversion requires manual field mapping. Baige’s automation script accepts a path to a ShareGPT‑style multimodal dataset and instantly produces Energon‑compatible binary files. The script serializes text, image and video tokens into a unified schema, eliminating repetitive preprocessing and ensuring data consistency.
Sequence Packing for Token‑Efficiency
Encoded lengths differ dramatically (e.g., video tokens can reach tens of thousands while text is usually a few hundred). Naïve mixed‑batch training wastes compute on padding. AIAK‑Training‑LLM introduces multimodal Sequence Packing that concatenates several short sequences—such as consecutive dialogue turns—into a single sequence whose length approaches the model’s maximum token limit. This aligns batch lengths, reduces padding, and increases effective token density.
Offline Packing on a CPU Cluster
Instead of performing packing on‑the‑fly during training (which stalls GPUs), the packing step is executed once on a CPU cluster before training starts. The workflow is:
Tokenize all modalities.
Apply Sequence Packing to produce packed sequences.
Persist the packed results to storage.
During SFT or RL training, GPUs load the pre‑packed data directly, removing runtime packing overhead.
This decouples data preparation from GPU computation and lowers overall cluster cost.
Measured Performance Gains
Token throughput roughly doubled; total SFT job duration reduced by ~5–6×.
GPU utilization increased because the data pipeline remained continuously full.
Training stability improved as offline packing eliminates dynamic memory‑overflow risks.
Full‑Stage Multimodal Training Support
The enhancements are applicable to both Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL) stages. They have been validated on several open‑source multimodal models, including QwenVL, InternVL, QianfanVL, LLaVA‑OneVision and the Wan series.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
