Artificial Intelligence 14 min read

How VeOmni Revolutionizes Multimodal Model Training with 40% Speed Gains

VeOmni, ByteDance’s open‑source unified multimodal training framework, tackles fragmented training pipelines by integrating LoRA fine‑tuning, FSDP, Ulysses, and Expert Parallel, delivering up to 40% higher throughput, up to 55% memory savings, and streamlined one‑click deployment for LLM, VLM, and video models.

Volcano Engine Developer Services

Aug 6, 2025

How VeOmni Revolutionizes Multimodal Model Training with 40% Speed Gains

Training Pain Points in the Multimodal Era Finally Get a “Special Remedy”

When large models evolve from single‑language to text + image + video multimodality, algorithm engineers face a fragmented training workflow:

Simultaneously iterating DiT, LLM, and VLM makes switching between codebases difficult.

Model‑type changes require extensive manual rewrites of parallel composition and memory scheduling.

DIT model distillation consumes massive resources without efficient training infra.

ByteDance engineers encountered these issues early and created VeOmni , a unified multimodal training framework validated on thousands of GPU‑hours, powering models such as UI‑Tars1.5. To serve more users, ByteDance open‑sourced VeOmni, and Volcano Engine added video‑model support.

What Is VeOmni? One Framework for All Multimodal Training

VeOmni is a joint effort of ByteDance’s Seed team, Volcano Machine Learning Platform, and the IaaS heterogeneous computing team. It unifies three aspects: multimodal, parallel strategy, and compute base .

Through a unified API it embeds LoRA lightweight fine‑tuning, FSDP, Ulysses, Expert Parallel, and automatic parallel‑search capabilities. Whether training a hundred‑billion‑parameter LLM, a cross‑modal VLM, or a 480P/720P text‑to‑video (T2V) or image‑to‑video (I2V) generator, developers can launch training with a single workflow.

The framework automatically partitions weight tensors, optimizes communication topology, reclaims dynamic memory, and performs asynchronous checkpointing on thousand‑GPU clusters. Real‑world tests on open‑source Wan 2.1 models show >40% higher throughput and significant reductions in memory usage and inter‑node bandwidth.

VeOmni enables ByteDance to achieve three goals: fastest rollout of new model forms, maximal utilization of massive compute, and minimal business‑logic changes, filling gaps left by existing open‑source frameworks for LLM, VLM, and video generation.

Volcano Engine users can leverage VeOmni’s capabilities directly in the Machine Learning Platform .

Five Core Advantages that Break Training Efficiency Bottlenecks

Memory‑Compute Dual Optimization: Minimal Extra Compute for Maximal Memory Savings

Traditional “big‑grain” recompute either disables or enables whole layers, often costing 10%‑20% extra compute for modest memory gain. VeOmni first computes an ROI (memory saved vs. compute cost) for each forward tensor, ranks operators by ROI, and selects only the most cost‑effective ones for recompute (e.g., gate1_mul saves 40 MB with 180 µs, down_proj saves 40 MB with 4000 µs, a 22× difference). This ensures memory constraints are met while keeping additional compute overhead minimal.

Result: with sufficient memory, recompute ratio drops from 60% to 30%, markedly improving training speed for DiT 720P video and long‑sequence LLM workloads.

Mixed Parallel “Combo Punch”: One‑Click Optimal Compute Partition, 55% Peak Memory Reduction

VeOmni embeds a multi‑dimensional parallel system supporting FSDP, Ulysses, and Expert Parallel (EP). A single launch script performs Cartesian composition of these primitives and automatically searches for the optimal compute‑partition plan.

FSDP shards parameters, gradients, and optimizer states across GPUs, breaking memory bottlenecks and scaling batch size reliably.

Ulysses Parallel decomposes attention along the head dimension for long‑sequence tasks, easing per‑GPU memory pressure.

Expert Parallel efficiently trains massive MoE expert networks.

Applied to ByteDance’s internal models, this system reduces peak memory for 480P/720P T2V/I2V tasks to ~45% of the baseline.

Operator‑Level Performance Deep Dive: Small‑Kernel Fusion, Hundreds‑Fold Memory Access Reduction

For DiT’s numerous small kernels that cause memory thrashing, VeOmni rewrites the attention‑FFN‑residual chain into a single kernel, dramatically cutting memory fragmentation and lowering memory accesses by hundreds of times.

Cross‑Model Coverage: LLM / VLM / Video Generation All Handled by One Framework

DiT training memory usage halved.

LLM long‑context training becomes automatic, with invisible memory partitioning.

VLM dual‑tower/single‑tower architectures scale linearly in Ring mode without code changes.

By combining operator‑level recompute, mixed parallelism, and kernel fusion, VeOmni removes the scalability bottlenecks that plague open‑source frameworks, offering plug‑and‑play, high‑efficiency compute for partners.

Real‑World Performance: 40% Faster Than Open‑Source Solutions Across Scenarios

Benchmarks on Wan 2.1‑14B (LoRA) show:

Compute‑type GPUs: I2V 720P + 48% speed, T2V 720P + 44.4% speed.

Memory‑type GPUs: I2V 720P + 59.5% speed, T2V 720P + 57.4% speed.

Small‑parameter models (Wan 2.1‑1.3B): T2V 480P + 51% speed.

Getting Started: One‑Click Training on Volcano Platform with Visual Performance Analysis

Create Training Task : select the model, configure instance specs and output path.

View Training Details : after creation, inspect logs under “Task Details – Logs”.

GPU Performance Analysis : navigate to “Custom Task > Task Details”, click “Create Performance Analysis”, then view flame graphs in Perfetto.

From Training to Inference: Full‑Link Integration

The Volcano Engine ML platform provides a Lora dataset for “flight‑level” effect training; users may also supply custom datasets.

Exporting Model Checkpoints

checkpoints/
├── global_step_xxx/           # weight snapshot per save
│   ├── extra_state/            # training state (sharded by rank)
│   │   └── extra_state_rank_*.pt
│   ├── hf_ckpt/                # HuggingFace‑compatible format
│   │   ├── config.json
│   │   └── diffusion_pytorch_model.safetensors
│   ├── model/                  # model parameter shards
│   │   └── __*_*.distcp
│   └── optimizer/              # optimizer state shards
│       └── __*_*.distcp
└── latest_checkpointed_iteration.txt  # latest step record

Weight Format Conversion Script

#!/usr/bin/env python
# convert.py  —  把 “blocks.…default.weight” → “diffusion_model.blocks.…weight”

from pathlib import Path
from safetensors.torch import load_file, save_file
import sys

if len(sys.argv) != 2:
    sys.exit(f"Usage: python {Path(__file__).name} <input.safetensors>")

inp = Path(sys.argv[1]).expanduser().resolve()
out = inp.with_name(inp.stem + "_styleB.safetensors")

tensors = load_file(str(inp))
converted = {}

for k, v in tensors.items():
    # 若无前缀则加 diffusion_model.
    if not k.startswith("diffusion_model."):
        k = "diffusion_model." + k
    # 去掉 .default.
    k = k.replace(".default.", ".")
    converted[k] = v

save_file(converted, str(out))
print(f"✓ 已保存: {out}")

Run the script with the checkpoint path to obtain the converted weights:

python convert.py yourpath/checkpoints/global_step_xxx/hf_ckpt/diffusion_pytorch_model.safetensors

Inference with Vefuser

Trained models can be served by veFuser , Volcano Engine’s diffusion model service framework, which optimizes LoRA and full‑fine‑tuned models for ultra‑low‑latency video generation, completing the end‑to‑end workflow from training to deployment.

Performance AI multimodal Framework parallelism training

Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Training Pain Points in the Multimodal Era Finally Get a “Special Remedy”

What Is VeOmni? One Framework for All Multimodal Training

Five Core Advantages that Break Training Efficiency Bottlenecks

Memory‑Compute Dual Optimization: Minimal Extra Compute for Maximal Memory Savings

Mixed Parallel “Combo Punch”: One‑Click Optimal Compute Partition, 55% Peak Memory Reduction

Operator‑Level Performance Deep Dive: Small‑Kernel Fusion, Hundreds‑Fold Memory Access Reduction

Cross‑Model Coverage: LLM / VLM / Video Generation All Handled by One Framework

Real‑World Performance: 40% Faster Than Open‑Source Solutions Across Scenarios

Getting Started: One‑Click Training on Volcano Platform with Visual Performance Analysis

From Training to Inference: Full‑Link Integration

Exporting Model Checkpoints

Weight Format Conversion Script

Inference with Vefuser

Volcano Engine Developer Services

How this landed with the community

Was this worth your time?

0 Comments

Cross‑Model Coverage: LLM / VLM / Video Generation All Handled by One Framework