VeRL-Omni: Universal RL Post‑Training for Diffusion and Multimodal Models

VeRL-Omni is an open‑source RL post‑training framework built on verl and vLLM‑Omni that enables efficient, high‑throughput rollout and flexible reward computation for diffusion, AR‑DiT, and unified multimodal generation models, supporting diverse hardware, modular trainers, and demonstrating up to 14% latency reduction and high training throughput in benchmark experiments.

Machine Heart
Machine Heart
Machine Heart
VeRL-Omni: Universal RL Post‑Training for Diffusion and Multimodal Models

VeRL‑Omni Overview

VeRL‑Omni is a universal reinforcement‑learning (RL) post‑training framework for multimodal generation models, built on the verl library and vLLM‑Omni. It supports diffusion transformers such as Qwen‑Image , hybrid AR‑DiT architectures like Qwen‑Omni , and unified understanding‑plus‑generation models (e.g., BAGEL , HunyuanImage‑3.0 ).

Motivation

Diffusion & multimodal extension : extend the flexibility and performance of verl to non‑autoregressive RL for diffusion and full‑modal models.

Heterogeneous rollout pipelines : a rollout traverses latent denoising trajectories and may invoke multiple components (text encoder → DiT → VAE) in several stages.

Complex workload scheduling : reward functions are themselves multimodal models (VLM judges, OCR scorers) and multimodal rollouts consume far more peak memory than text‑only generation, making orchestration non‑trivial.

Key Technical Features

Efficient multimodal rollout : integrates vLLM‑Omni asynchronous high‑throughput serving. Accuracy matches diffusers while rollout efficiency is improved through step‑wise continuous batching and embedding caching.

Flexible reward engine : supports rule‑based and model‑based rewards (e.g., VLM‑as‑judge for OCR). Reward inference runs on vLLM and overlaps with rollout and training to reduce end‑to‑end latency.

Modular training backend : provides multiple trainers ( DiffusersFSDP, Megatron, VeOmni) with built‑in optimizations for diffusion and multimodal models, compatible with parallel strategies such as FSDP, USP, and TP.

Broad hardware compatibility : runs on NVIDIA GPUs and Ascend NPUs, allowing flexible backend switching.

End‑to‑end training recipes and benchmarks : includes reference performance results that demonstrate high training throughput.

Algorithm Support

The framework includes the FlowGRPO algorithm, an online policy method for flow‑matching diffusion models.

Getting Started

Installation instructions:

https://verl-omni.readthedocs.io/en/latest/start/install.html

Example scripts for image, audio, and video RL are located in the examples directory of the repository https://github.com/verl-project/verl-omni/tree/main/examples.

Demo: Qwen‑Image FlowGRPO training uses an OCR reward model ( Qwen3‑VL‑8B‑Instruct) that reads rendered text in generated images and scores it against ground‑truth.

FlowGRPO Algorithm Details

Rollout generation : the diffusion policy generates samples, recording log probabilities and image trajectories.

Reward scoring : a reward model assigns a score to each sample, from which a trajectory advantage is computed.

Policy optimization : a CLIP‑style loss updates the policy using the computed advantage.

Weight synchronization : trainer weights are periodically synchronized to rollout workers so that new samples reflect the latest policy.

Experimental Results

LoRA Fine‑tuning

On an NVIDIA H800 GPU, training throughput reaches the reported level. Placing the reward model on a separate GPU and overlapping its inference with policy training reduces per‑step wall‑clock time by approximately 14%.

Full‑model Fine‑tuning

Non‑CFG full‑model OCR training on four NVIDIA H200 GPUs achieves 0.510 images / GPU / s , with each training step taking about 250 s . After only 120 steps, generated images show a noticeable improvement in text rendering quality. Training curves indicate that both the critic reward and the validation reward converge stably.

Roadmap

Expand model support to emerging diffusion and multimodal models for image, video, and audio generation, as well as unified tasks.

Integrate additional advanced RL algorithms such as DiffusionNFT .

Develop fully asynchronous RL pipelines that further increase rollout throughput and hardware utilization.

Deepen co‑optimization with vLLM‑Omni (parallelism, quantization, batching, scheduling).

Release more highly optimized trainer engines beyond DiffusersFSDPTrainer.

Broaden hardware support, continuing work on Ascend NPU paths and inviting community‑built hardware plugins.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vLLMdiffusion modelsmultimodal generationRLFlowGRPOVeRL-Omni
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.