VeRL-Omni: A Universal RL Post‑Training Framework for Diffusion and Multimodal Generation Models
VeRL-Omni introduces a universal reinforcement‑learning post‑training framework that extends the verl and vLLM‑Omni stacks to support diffusion transformers, hybrid AR‑DiT, and unified understanding‑generation models, offering high‑throughput multimodal rollout, flexible reward engines, modular trainers, and broad hardware compatibility.
Overview
VeRL-Omni is a universal reinforcement‑learning (RL) post‑training framework for multimodal generative models. It builds on the verl and vLLM‑Omni stacks and supports diffusion transformers (e.g., Qwen‑Image), hybrid AR‑DiT architectures (e.g., Qwen‑Omni), and unified understanding‑plus‑generation models (e.g., BAGEL, HunyuanImage‑3.0).
Motivation
Multimodal RL—covering image, video, and audio generation—faces three critical gaps:
Diffusion & multimodal extension: need to extend flexible, high‑performance training to diffusion transformers, hybrid AR‑DiT, and unified models.
Heterogeneous rollout pipelines: rollouts traverse latent denoising trajectories and may invoke multiple model components (text encoder → DiT → VAE) in a single step.
Complex load scheduling: reward functions themselves are multimodal models (VLM judges, OCR scorers) and multimodal rollouts consume far higher peak memory than text‑only generation, making orchestration difficult.
Key Features
Efficient multimodal rollout: integrates vLLM‑Omni’s asynchronous high‑throughput serving; accuracy matches diffusers while step‑wise continuous batching and embedding caching continuously improve throughput.
Flexible reward engine: supports rule‑based and model‑based rewards (e.g., VLM‑as‑judge for OCR); vLLM accelerates VLM/LLM reward inference; reward computation overlaps with rollout and training to cut end‑to‑end latency.
Modular training back‑ends: provides multiple trainers ( DiffusersFSDP, Megatron, VeOmni) with built‑in optimizations for diffusion and multimodal models; compatible with parallel strategies such as FSDP, USP, and TP.
Broad hardware support: runs on NVIDIA GPUs and Ascend NPU, allowing seamless switching between hardware back‑ends.
End‑to‑end training recipes and benchmarks: includes reference performance results that demonstrate high training throughput.
FlowGRPO Algorithm
FlowGRPO is an online‑policy method for flow‑matching models. It samples multiple steps of a stochastic differential equation (SDE) using a diffusion policy model for efficient RL exploration and evaluates generated samples with a model‑based reward.
Rollout generation: the diffusion policy generates rollout samples, collecting log probabilities and image trajectories.
Reward scoring: a reward model assigns a score to each sample, producing a trajectory advantage.
Policy optimization: a CLIP‑style loss updates the policy based on the computed advantage.
Weight synchronization: trainer weights are periodically synced to rollout workers so that samples reflect the latest policy.
Performance Highlights
On an NVIDIA H800 GPU, placing the reward model on a separate GPU and overlapping it with policy training reduces per‑step wall‑clock time by roughly 14%.
Full‑model fine‑tuning of Qwen‑Image for OCR on four NVIDIA H200 GPUs achieves 0.510 images / GPU / s, with each training step taking about 250 s. After only 120 steps, rendered text quality in generated images shows a noticeable improvement, and both critic‑reward and validation‑reward curves converge stably.
Getting Started
Code repository: https://github.com/verl-project/verl-omni
Documentation: https://verl-omni.readthedocs.io/en/latest/start/install.html
Examples directory (starter scripts for image, audio, and video RL trainers, with wandb tracking): https://github.com/verl-project/verl-omni/tree/main/examples
Demo (FlowGRPO) trains Qwen‑Image using an OCR reward model based on Qwen3‑VL‑8B‑Instruct, which reads rendered text in generated images and compares it with ground‑truth captions.
Roadmap
Expand model support to emerging diffusion and multimodal architectures for image, video, audio, and unified tasks.
Integrate additional RL algorithms such as DiffusionNFT.
Develop a fully asynchronous RL pipeline that tightly couples actors, rollouts, and rewards to further boost throughput and hardware utilization.
Deepen integration with vLLM‑Omni (parallelism, quantization, batching, scheduling optimizations) to accelerate rollout generation.
Release more highly optimized trainers for multimodal and diffusion models built on Megatron‑core and VeOmni.
Broaden hardware support, refining the Ascend NPU path and enabling community‑built hardware plugins.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
