Boost Multimodal Model Training Efficiency with Offline Sequence Packing and Mixed‑Modality Data

Baidu's Baige team introduces an extended multimodal data loader, automated ShareGPT format conversion, and offline sequence packing techniques that together double token throughput, cut SFT training time by up to six times, and improve GPU utilization and stability for large vision‑language models.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Boost Multimodal Model Training Efficiency with Offline Sequence Packing and Mixed‑Modality Data

Multimodal Data Loading for Complex Dialogues

The original Megatron‑Energon loader supports only single‑turn QA samples with at most one image or video. To train vision‑language models (VLMs) on heterogeneous data (text, images, video) the Baige team extended the built‑in DataLoader in the AIAK‑Training‑LLM + Megatron + Energon stack:

Added separate encoders for text‑image, text‑video and other modality pairs.

Enabled full multi‑turn dialogue structures.

Allowed an arbitrary number of images or video frames per sample.

Provided a customizable prompt‑template mechanism so ShareGPT‑style conversations can be ingested without manual preprocessing.

Automated Conversion of ShareGPT‑style Datasets

Megatron‑Energon’s native conversion requires manual field mapping. Baige’s automation script accepts a path to a ShareGPT‑style multimodal dataset and instantly produces Energon‑compatible binary files. The script serializes text, image and video tokens into a unified schema, eliminating repetitive preprocessing and ensuring data consistency.

Sequence Packing for Token‑Efficiency

Encoded lengths differ dramatically (e.g., video tokens can reach tens of thousands while text is usually a few hundred). Naïve mixed‑batch training wastes compute on padding. AIAK‑Training‑LLM introduces multimodal Sequence Packing that concatenates several short sequences—such as consecutive dialogue turns—into a single sequence whose length approaches the model’s maximum token limit. This aligns batch lengths, reduces padding, and increases effective token density.

Offline Packing on a CPU Cluster

Instead of performing packing on‑the‑fly during training (which stalls GPUs), the packing step is executed once on a CPU cluster before training starts. The workflow is:

Tokenize all modalities.

Apply Sequence Packing to produce packed sequences.

Persist the packed results to storage.

During SFT or RL training, GPUs load the pre‑packed data directly, removing runtime packing overhead.

This decouples data preparation from GPU computation and lowers overall cluster cost.

Measured Performance Gains

Token throughput roughly doubled; total SFT job duration reduced by ~5–6×.

GPU utilization increased because the data pipeline remained continuously full.

Training stability improved as offline packing eliminates dynamic memory‑overflow risks.

Full‑Stage Multimodal Training Support

The enhancements are applicable to both Supervised Fine‑Tuning (SFT) and Reinforcement Learning (RL) stages. They have been validated on several open‑source multimodal models, including QwenVL, InternVL, QianfanVL, LLaVA‑OneVision and the Wan series.

AI infrastructuredata loadingAIAKMultimodal TrainingGPU efficiencysequence packing
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.