UniRL: Tencent Hunyuan’s Open‑Source Framework Unifying Multimodal RL Training
UniRL is an open‑source, distributed reinforcement‑learning post‑training framework that consolidates fragmented pipelines for image, video, and language‑vision models, offering a unified rollout‑reward‑advantage‑train‑sync contract, extensive model support, built‑in algorithms, and multi‑modal reward components to lower engineering barriers in AIGC research.
Challenges in multimodal reinforcement learning
Different generation processes – LLM RL optimizes discrete token sequences, while image/video diffusion operates on continuous latent trajectories. A single rollout that mixes autoregressive tokens and latent denoising makes credit assignment, log‑probability computation, and policy updates more complex.
Unstable system loops – Rollout, log‑prob replay, and policy updates span multiple models and back‑ends. Exact reproduction of sampling conditions (noise, timestep, conditioning) is required; otherwise a training‑inference mismatch introduces bias in the policy gradient.
Heavyweight reward pipelines – Multimodal rewards depend on VLMs, OCR, aesthetic models, video understanding models, or multi‑turn agents, forming costly evaluation chains rather than lightweight text‑only metrics.
High‑dimensional trajectory storage – Intermediate products are high‑dimensional latents, noise tensors, timesteps, and condition states. Storing and transmitting these for log‑prob replay inflates memory usage, especially for high‑resolution video generation.
UniRL: a unified distributed RL post‑training framework
UniRL decouples from any specific model family, algorithm, or training stack. Its core skeleton combines a Ray worker group, a Hydra flat recipe, composable training back‑ends, and a pluggable rollout engine, abstracting a unified RL loop:
rollout → reward → advantage → train → weight‑syncThe framework introduces a typed rollout data model called track . Each generation stage is represented as a TextSegment (autoregressive phase) or a LatentSegment (diffusion phase). Parent‑child relationships link tracks, enabling pipelines such as Bagel or HunyuanImage 3.0 that first generate textual thoughts and then produce images via DiT.
Supported multimodal models
Image generators: SD3/3.5, Qwen‑Image, Z‑Image, FLUX.2‑Klein
Video generators: HunyuanVideo 1.0 & 1.5, WAN series
Large language models: Qwen‑3 series
Vision‑language models: Qwen‑VL series
Native unified multimodal models: HunyuanImage 3.0, Bagel
Compositional architectures: LLM/VLM + Diffusion prompt‑enhancer
Integrated RL algorithms
Policy‑gradient families: FlowGRPO, DanceGRPO, MixGRPO, LLM/VLM GRPO
Forward‑process family: DiffusionNFT (efficient training without full SDE rollout)
Flow‑DPPO – replaces PPO ratio clipping with a step‑wise KL‑based proximal constraint and applies an asymmetric divergence mask that blocks updates only when the new policy moves farther from the old one beyond a threshold, improving stability for flow/diffusion image and video models. Paper: https://arxiv.org/pdf/2606.11025
DRPO – uses advantage‑weighted smooth policy‑shift regularization instead of hard importance‑ratio clipping, providing continuous gradient corrections beyond the trust‑region boundary for more stable LLM RL. Paper: https://arxiv.org/pdf/2606.09821
Reward and evaluation components
Similarity / rule metrics: CLIPScore, GOT‑OCR‑2.0
Aesthetic / preference models: PickScore, HPSv2 / HPSv3, ImageReward
VLM‑as‑judge: UnifiedReward, GenEval2, WISE
Video‑specific evaluators: VideoPickScore, VideoAlign
Trajectory handling and memory management
UniRL manages high‑dimensional intermediate states through batch forward processing, sparse tracks, offload mechanisms, and sleep/wake cycles, preventing large tensors from aggregating on the driver and reducing peak GPU memory pressure.
Code and documentation
GitHub repository: https://github.com/Tencent-Hunyuan/UniRL
Official documentation: https://unirl-project.github.io/unirl/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
