UniRL: Tencent Hunyuan’s Open‑Source Framework Unifying Multimodal RL Training

UniRL is an open‑source, distributed reinforcement‑learning post‑training framework that consolidates fragmented pipelines for image, video, and language‑vision models, offering a unified rollout‑reward‑advantage‑train‑sync contract, extensive model support, built‑in algorithms, and multi‑modal reward components to lower engineering barriers in AIGC research.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
UniRL: Tencent Hunyuan’s Open‑Source Framework Unifying Multimodal RL Training

Challenges in multimodal reinforcement learning

Different generation processes – LLM RL optimizes discrete token sequences, while image/video diffusion operates on continuous latent trajectories. A single rollout that mixes autoregressive tokens and latent denoising makes credit assignment, log‑probability computation, and policy updates more complex.

Unstable system loops – Rollout, log‑prob replay, and policy updates span multiple models and back‑ends. Exact reproduction of sampling conditions (noise, timestep, conditioning) is required; otherwise a training‑inference mismatch introduces bias in the policy gradient.

Heavyweight reward pipelines – Multimodal rewards depend on VLMs, OCR, aesthetic models, video understanding models, or multi‑turn agents, forming costly evaluation chains rather than lightweight text‑only metrics.

High‑dimensional trajectory storage – Intermediate products are high‑dimensional latents, noise tensors, timesteps, and condition states. Storing and transmitting these for log‑prob replay inflates memory usage, especially for high‑resolution video generation.

UniRL: a unified distributed RL post‑training framework

UniRL decouples from any specific model family, algorithm, or training stack. Its core skeleton combines a Ray worker group, a Hydra flat recipe, composable training back‑ends, and a pluggable rollout engine, abstracting a unified RL loop:

rollout → reward → advantage → train → weight‑sync

The framework introduces a typed rollout data model called track . Each generation stage is represented as a TextSegment (autoregressive phase) or a LatentSegment (diffusion phase). Parent‑child relationships link tracks, enabling pipelines such as Bagel or HunyuanImage 3.0 that first generate textual thoughts and then produce images via DiT.

Supported multimodal models

Image generators: SD3/3.5, Qwen‑Image, Z‑Image, FLUX.2‑Klein

Video generators: HunyuanVideo 1.0 & 1.5, WAN series

Large language models: Qwen‑3 series

Vision‑language models: Qwen‑VL series

Native unified multimodal models: HunyuanImage 3.0, Bagel

Compositional architectures: LLM/VLM + Diffusion prompt‑enhancer

Integrated RL algorithms

Policy‑gradient families: FlowGRPO, DanceGRPO, MixGRPO, LLM/VLM GRPO

Forward‑process family: DiffusionNFT (efficient training without full SDE rollout)

Flow‑DPPO – replaces PPO ratio clipping with a step‑wise KL‑based proximal constraint and applies an asymmetric divergence mask that blocks updates only when the new policy moves farther from the old one beyond a threshold, improving stability for flow/diffusion image and video models. Paper: https://arxiv.org/pdf/2606.11025

DRPO – uses advantage‑weighted smooth policy‑shift regularization instead of hard importance‑ratio clipping, providing continuous gradient corrections beyond the trust‑region boundary for more stable LLM RL. Paper: https://arxiv.org/pdf/2606.09821

Reward and evaluation components

Similarity / rule metrics: CLIPScore, GOT‑OCR‑2.0

Aesthetic / preference models: PickScore, HPSv2 / HPSv3, ImageReward

VLM‑as‑judge: UnifiedReward, GenEval2, WISE

Video‑specific evaluators: VideoPickScore, VideoAlign

Trajectory handling and memory management

UniRL manages high‑dimensional intermediate states through batch forward processing, sparse tracks, offload mechanisms, and sleep/wake cycles, preventing large tensors from aggregating on the driver and reducing peak GPU memory pressure.

Code and documentation

GitHub repository: https://github.com/Tencent-Hunyuan/UniRL

Official documentation: https://unirl-project.github.io/unirl/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMDiffusion Modelsdistributed trainingVLMReward EngineeringUnified FrameworkMultimodal RL
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.