How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment

BranchGRPO introduces a tree‑structured branching, reward‑fusion, and lightweight pruning framework that dramatically speeds up diffusion and flow model training while delivering denser, more stable reward signals, achieving up to five‑fold faster convergence and higher alignment scores on image and video generation benchmarks.

Data Party THU
Data Party THU
Data Party THU
How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment

Background and Challenge

Diffusion and flow‑matching models achieve high‑fidelity image and video generation, but large‑scale pre‑training alone does not guarantee alignment with human preferences. Human‑feedback reinforcement learning (RLHF) is used to optimize these models, yet the standard Group‑Relative Policy Optimization (GRPO) suffers from two bottlenecks when applied to diffusion/flow models:

Low efficiency: sequential roll‑out requires O(N×T) sampling, where N is the batch size and T the number of diffusion steps, causing redundant computation.

Sparse rewards: only a single terminal reward is computed and uniformly back‑propagated, leading to high‑variance gradients and unstable training.

BranchGRPO Overview

BranchGRPO restructures the sampling process into a tree‑shaped expansion, introduces reward fusion, and applies lightweight pruning. The design yields simultaneous improvements in sampling speed, training stability, and alignment quality.

Tree‑Structured Branching in Diffusion

At predefined diffusion steps the trajectory splits into multiple child paths that share the computed prefix, avoiding redundant forward passes.

Branching

Each selected step can generate several sub‑paths; shared prefix calculations reduce overall sampling cost while preserving the full diffusion dynamics.

Reward Fusion & Depth‑wise Advantage

Leaf‑node rewards are propagated upward and normalized at each depth, producing dense advantage signals for every step.

Pruning

Width pruning : retain only key leaf nodes for back‑propagation, cutting gradient computation.

Depth pruning : skip back‑propagation through selected layers while keeping forward passes and reward evaluation, further lowering overhead.

Experimental Results

Image Alignment (HPDv2.1)

BranchGRPO achieves higher alignment scores (HPS‑v2.1 0.363–0.369, ImageReward 1.319) and reduces iteration time from 698 s (DanceGRPO) to 493 s (full), 314 s (pruned) and 148 s (Mix variant), a speed‑up of up to 4.7×.

Video Generation (WanX‑1.3B)

Compared with baseline models that exhibit flickering and deformation, BranchGRPO produces sharper frames with richer details and consistent temporal coherence. An iteration takes ≈ 8 min versus ≈ 20 min for DanceGRPO, more than doubling training efficiency.

Ablation Studies

Moderate branching factor and early dense splitting accelerate reward improvement; weighted reward fusion stabilizes training; depth pruning yields the best final performance; mixed ODE‑SDE scheduling provides the fastest training while preserving stability.

Diversity Preservation

Maximum Mean Discrepancy (MMD²) ≈ 0.019, comparable to sequential sampling, indicating that branching does not compromise sample diversity.

Scalability (Scaling Law)

Increasing the branch factor or depth continuously improves performance. For a batch of 81 samples, DanceGRPO requires 2400 s per iteration, whereas BranchGRPO finishes in 680 s.

Conclusion and Outlook

BranchGRPO converts sparse terminal rewards into dense, throughout‑process signals, achieving up to five‑fold faster convergence and higher alignment quality on both image and video tasks. Future work may explore adaptive branching/pruning and extensions to multimodal or larger‑scale generation, positioning BranchGRPO as a core technique for RLHF in diffusion and flow models.

@article{li2025branchgrpo, title={BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models}, author={Li, Yuming and Wang, Yikai and Zhu, Yuying and Zhao, Zhongyu and Lu, Ming and She, Qi and Zhang, Shanghang}, journal={arXiv preprint arXiv:2509.06040}, year={2025}}

Paper: https://arxiv.org/pdf/2509.06040

Project page: https://fredreic1849.github.io/BranchGRPO-Webpage/

Code: https://github.com/Fredreic1849/BranchGRPO

PKU HMI Lab: https://pku-hmi-lab.github.io/HMI-Web/index.html

Code example

来源:机器之心
本文
约2400字
,建议阅读
10
分钟
BranchGRPO 通过树形分叉、奖励融合与轻量剪枝,创新性地融合了效率与稳定。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

efficiencydiffusion modelsreinforcement learningstabilityRLHFvisual generationBranchGRPO
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.