VeRL-Omni: A Universal RL Post‑Training Framework for Diffusion and Multimodal Generation Models

VeRL-Omni introduces a universal reinforcement‑learning post‑training framework that extends the verl and vLLM‑Omni stacks to support diffusion transformers, hybrid AR‑DiT, and unified understanding‑generation models, offering high‑throughput multimodal rollout, flexible reward engines, modular trainers, and broad hardware compatibility.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
VeRL-Omni: A Universal RL Post‑Training Framework for Diffusion and Multimodal Generation Models

Overview

VeRL-Omni is a universal reinforcement‑learning (RL) post‑training framework for multimodal generative models. It builds on the verl and vLLM‑Omni stacks and supports diffusion transformers (e.g., Qwen‑Image), hybrid AR‑DiT architectures (e.g., Qwen‑Omni), and unified understanding‑plus‑generation models (e.g., BAGEL, HunyuanImage‑3.0).

Motivation

Multimodal RL—covering image, video, and audio generation—faces three critical gaps:

Diffusion & multimodal extension: need to extend flexible, high‑performance training to diffusion transformers, hybrid AR‑DiT, and unified models.

Heterogeneous rollout pipelines: rollouts traverse latent denoising trajectories and may invoke multiple model components (text encoder → DiT → VAE) in a single step.

Complex load scheduling: reward functions themselves are multimodal models (VLM judges, OCR scorers) and multimodal rollouts consume far higher peak memory than text‑only generation, making orchestration difficult.

Key Features

Efficient multimodal rollout: integrates vLLM‑Omni’s asynchronous high‑throughput serving; accuracy matches diffusers while step‑wise continuous batching and embedding caching continuously improve throughput.

Flexible reward engine: supports rule‑based and model‑based rewards (e.g., VLM‑as‑judge for OCR); vLLM accelerates VLM/LLM reward inference; reward computation overlaps with rollout and training to cut end‑to‑end latency.

Modular training back‑ends: provides multiple trainers ( DiffusersFSDP, Megatron, VeOmni) with built‑in optimizations for diffusion and multimodal models; compatible with parallel strategies such as FSDP, USP, and TP.

Broad hardware support: runs on NVIDIA GPUs and Ascend NPU, allowing seamless switching between hardware back‑ends.

End‑to‑end training recipes and benchmarks: includes reference performance results that demonstrate high training throughput.

FlowGRPO Algorithm

FlowGRPO is an online‑policy method for flow‑matching models. It samples multiple steps of a stochastic differential equation (SDE) using a diffusion policy model for efficient RL exploration and evaluates generated samples with a model‑based reward.

Rollout generation: the diffusion policy generates rollout samples, collecting log probabilities and image trajectories.

Reward scoring: a reward model assigns a score to each sample, producing a trajectory advantage.

Policy optimization: a CLIP‑style loss updates the policy based on the computed advantage.

Weight synchronization: trainer weights are periodically synced to rollout workers so that samples reflect the latest policy.

Performance Highlights

On an NVIDIA H800 GPU, placing the reward model on a separate GPU and overlapping it with policy training reduces per‑step wall‑clock time by roughly 14%.

Full‑model fine‑tuning of Qwen‑Image for OCR on four NVIDIA H200 GPUs achieves 0.510 images / GPU / s, with each training step taking about 250 s. After only 120 steps, rendered text quality in generated images shows a noticeable improvement, and both critic‑reward and validation‑reward curves converge stably.

Getting Started

Code repository: https://github.com/verl-project/verl-omni

Documentation: https://verl-omni.readthedocs.io/en/latest/start/install.html

Examples directory (starter scripts for image, audio, and video RL trainers, with wandb tracking): https://github.com/verl-project/verl-omni/tree/main/examples

Demo (FlowGRPO) trains Qwen‑Image using an OCR reward model based on Qwen3‑VL‑8B‑Instruct, which reads rendered text in generated images and compares it with ground‑truth captions.

Roadmap

Expand model support to emerging diffusion and multimodal architectures for image, video, audio, and unified tasks.

Integrate additional RL algorithms such as DiffusionNFT.

Develop a fully asynchronous RL pipeline that tightly couples actors, rollouts, and rewards to further boost throughput and hardware utilization.

Deepen integration with vLLM‑Omni (parallelism, quantization, batching, scheduling optimizations) to accelerate rollout generation.

Release more highly optimized trainers for multimodal and diffusion models built on Megatron‑core and VeOmni.

Broaden hardware support, refining the Ascend NPU path and enabling community‑built hardware plugins.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vLLMdiffusionmultimodal generationRLFlowGRPOVeRL-Omni
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.