Artificial Intelligence 23 min read

LoongForge: Open‑Source Multimodal Training Framework Runs on GPU and Kunlun XPU with 45% Speedup

LoongForge is an open‑source, Megatron‑based multimodal training framework that unifies LLM, VLM, VLA and diffusion models, runs seamlessly on NVIDIA GPUs and Baidu Kunlun XPU, and delivers 15%‑45% end‑to‑end training acceleration while scaling linearly on thousands of cards.

Baidu Intelligent Cloud Tech Hub

Apr 24, 2026

LoongForge: Open‑Source Multimodal Training Framework Runs on GPU and Kunlun XPU with 45% Speedup

When models begin to understand images, video, and the physical world, the training stack and model form diverge, creating a structural mismatch that LoongForge aims to resolve.

1. Industry background

Multimodal as the new foundation Early multimodal models (e.g., InternVL, Qwen3‑VL) attached a visual encoder to a frozen LLM, keeping training objectives and representation spaces separate. Newer models such as Ernie 4.5, Qwen3.6, and Kimi K2.6 embed vision and language in a single pre‑training process, making multimodality a core structural component.

Heterogeneous compute supply Kunlun XPU P800 chips have moved from pilot projects to large‑scale deployment, with thousand‑card clusters common for large‑model training. This diversification forces training frameworks to support cross‑platform execution – a single codebase must run reliably on both GPU and XPU.

2. Core challenges in the multimodal era

Iteration speed vs. engineering complexity Megatron‑based frameworks tightly couple model definition with distributed strategies; adding a new model often requires deep code changes and weeks of adaptation. FSDP offers rapid model onboarding but suffers from communication and memory bottlenecks at extreme scale, forcing a trade‑off between fast iteration and high performance.

Hidden performance loss from heterogeneous structures Multimodal data mixes single‑image, multi‑image, video, and pure‑text samples, leading to vastly different sequence lengths. Traditional data‑parallelism distributes samples evenly, but because attention complexity is quadratic, actual compute load can differ dramatically across GPUs, causing idle time and inflated compute cost.

Sunk cost of cross‑platform migration Community frameworks are often tightly bound to a specific hardware ecosystem; porting to domestic chips typically requires maintaining separate code branches. Even after migration, lack of deep framework‑level optimizations leaves a noticeable gap between "can run" and "can run efficiently".

3. LoongForge positioning and core value

LoongForge, released by Baidu Baige, is built on the Megatron engine and re‑architected for native multimodal scenarios. It has been production‑validated on both GPU and Kunlun XPU clusters ranging from a few cards to >5,000 cards, covering LLM, VLM, VLA and diffusion workloads.

Unified : Supports 20+ model families (DeepSeek, Qwen, InternVL, LLaVA‑OV, ERNIE, MiniMax, MIMO, Pi0.5, WAN, etc.) with a single codebase that spans pre‑training to SFT and works on both NVIDIA GPU and Kunlun XPU.

Efficient : End‑to‑end training acceleration of 15%‑40% on mainstream models, up to 4.8× on cutting‑edge architectures like DeepSeek V3.2, and >90% linear scaling on a 5,000‑card Kunlun P800 cluster.

Easy‑to‑use : Model definition is expressed in a declarative YAML; new components are registered via configuration without touching low‑level code, shrinking adaptation cycles from weeks to days.

4. Architecture and key technical capabilities

4.1 Model layer – unified abstraction

All multimodal models share a common backbone: an LLM core with modality‑specific encoders and adapters. LoongForge introduces three logical sub‑layers:

Encoder : perception encoders (e.g., ViT) for images/video.

Foundation : the language backbone.

OmniCombinationModel : scheduler that composes encoder and foundation, handling parallelism and data flow.

A single YAML file can describe the entire network; the framework automatically generates the appropriate parallel strategy and data routing.

4.2 System layer – end‑to‑end optimization

CCT (Computation‑Communication‑Transfer) parallelism : Breaks the "memory vs. communication" trade‑off in MoE long‑sequence training by offloading activation memory and overlapping compute, communication and data transfer. On an A800 cluster, Qwen3‑30B‑A3B with 32K context gains 16% performance, while competing solutions OOM.

ChunkPipe pipeline parallelism : Converts the linear memory growth of ultra‑long sequences into a fixed overhead, eliminating the need for sequence parallelism and enabling 1M‑token training on modest clusters.

DSA operator fusion : For sparse‑attention models like DeepSeek V3.2, LoongForge fuses indexing, sparse attention, MQA, and sequence stitching kernels, delivering ~5× end‑to‑end speedup over non‑CUDA‑fused baselines.

4.3 Hardware layer – one code, many platforms

On GPU, LoongForge directly leverages PyTorch/CUDA and the native Megatron implementation for maximum performance. For Kunlun XPU, a lightweight XPU_Plugin abstracts the hardware differences, allowing the same Megatron engine to run without code changes—simply switch an environment variable.

5. Performance numbers (same hardware, optimal configs)

Qwen3‑30B‑A3B (MoE) – 32K sequence – 16% faster than community baselines.

DeepSeek V3.2 (MoE) – 8K sequence – 480% faster.

Qwen3‑Next (MoE) – 32K sequence – 15% faster.

Qwen3‑VL‑30B‑A3B (VLM) – 32K sequence – 45% faster.

PI0.5 (VLA) – BF16 – 49% faster.

Across a variety of models, LoongForge consistently yields 15%‑45% end‑to‑end training acceleration, and on a 5,000‑card Kunlun P800 cluster achieves >90% linear scaling.

6. Real‑world cases

LLaVA‑OneVision‑2.0 : An open‑source full‑frame‑rate multimodal model for video understanding. LoongForge’s heterogeneous parallelism and load‑balancing cut video token consumption dramatically, matching Qwen3‑VL quality while reducing cost and latency.

LLaVA‑OneVision‑1.5 : Introduced a new RICE‑ViT encoder; the team switched encoders in a few days and completed 8B VLM Stage‑1.5 pre‑training on 128 A800 GPUs, demonstrating plug‑and‑play capability.

Qianfan‑VL series : Enterprise‑grade multimodal models (3B, 8B, 70B) trained on >5,000 Kunlun P800 cards, processing >3 trillion multimodal tokens with 90%+ cluster efficiency, validating the framework’s stability at massive scale.

7. Operation demo – YAML‑driven, out‑of‑the‑box

Model wiring – a single YAML defines the encoder, projector and foundation. Switching the language backbone from Qwen3 to DeepSeek only requires changing one line:

defaults:
  - ../../models/[email protected]_encoder: qwen3_vit
  - ../../models/[email protected]_projector: qwen_mlp_adapter
  - ../../models/[email protected]: qwen3_30b_a3b
+ ../../models/[email protected]: deepseek_v3
  - _self_

Training arguments (Megatron compatible) :

TRAINING_ARGS=(
    --training-phase sft
    --seq-length 32768
    --micro-batch-size 1
    ...
)
MODEL_PARALLEL_ARGS=(
    --tensor-model-parallel-size 1
    --pipeline-model-parallel-size 2
    --expert-model-parallel-size 8
    ...
)

Component‑level configuration (Hydra extensions) :

# Different TP for encoder and foundation
+model.image_encoder.tensor-model-parallel-size=1
+model.foundation.tensor-model-parallel-size=4

# Freeze components independently
+model.image_encoder.freeze=True
+model.foundation.freeze=True

Weight loading – LoongForge can ingest HuggingFace checkpoints directly and export back to HF format after training:

TRAINING_ARGS=(
    --load $CHECKPOINT_PATH   # HF directory
    --save $CHECKPOINT_PATH   # saves high‑performance checkpoint
    --save-interval 40
    --save-hf true           # export HF weights after training
    --save-hf-path /path/to/output
    ...
)

Data preprocessing – one‑line command converts raw multimodal data to WebDataset format:

python tools/data_preprocess/vlm/convert_to_webdataset.py \
  --output_dir /workspace/wds_data/ \
  --json_file tests/datasets/vlm/mllm_demo.json \
  --image_dir tests/datasets/vlm/ \
  --video_dir tests/datasets/vlm/ \
  --media mix \
  --columns_messages messages \
  --maxcount 10000 \
  --maxsize 3000000000 \
  --sample_type multi_mix_qa

Training is launched with a single command using the YAML configuration located under configs/models/ and example scripts under examples/. Full list of supported models is documented at https://loongforge.readthedocs.io/en/latest/get_started/support_model.html.

8. Roadmap

Expand model ecosystem – add Kimi K2.6, DeepSeek V4 and deeper support for embodied models.

Long‑sequence training – enable million‑token contexts with lower memory overhead.

Training performance – continue to improve parallelism, operator fusion, memory management and communication scheduling.

Train‑to‑inference co‑optimization – provide MTP best practices for faster decoding.

Usability – enrich toolchain, reduce onboarding friction, and automate more of the model‑adaptation workflow.

9. Conclusion

LoongForge, released under the Apache 2.0 license, offers a unified, high‑performance, and easy‑to‑use training stack that can become the public foundation for the multimodal era, enabling more ideas to be validated quickly.