How BranchGRPO Accelerates and Stabilizes Diffusion Model Alignment
BranchGRPO introduces a tree‑structured branching, reward‑fusion, and lightweight pruning framework that dramatically speeds up diffusion and flow model training while delivering denser, more stable reward signals, achieving up to five‑fold faster convergence and higher alignment scores on image and video generation benchmarks.
Background and Challenge
Diffusion and flow‑matching models achieve high‑fidelity image and video generation, but large‑scale pre‑training alone does not guarantee alignment with human preferences. Human‑feedback reinforcement learning (RLHF) is used to optimize these models, yet the standard Group‑Relative Policy Optimization (GRPO) suffers from two bottlenecks when applied to diffusion/flow models:
Low efficiency: sequential roll‑out requires O(N×T) sampling, where N is the batch size and T the number of diffusion steps, causing redundant computation.
Sparse rewards: only a single terminal reward is computed and uniformly back‑propagated, leading to high‑variance gradients and unstable training.
BranchGRPO Overview
BranchGRPO restructures the sampling process into a tree‑shaped expansion, introduces reward fusion, and applies lightweight pruning. The design yields simultaneous improvements in sampling speed, training stability, and alignment quality.
Tree‑Structured Branching in Diffusion
At predefined diffusion steps the trajectory splits into multiple child paths that share the computed prefix, avoiding redundant forward passes.
Branching
Each selected step can generate several sub‑paths; shared prefix calculations reduce overall sampling cost while preserving the full diffusion dynamics.
Reward Fusion & Depth‑wise Advantage
Leaf‑node rewards are propagated upward and normalized at each depth, producing dense advantage signals for every step.
Pruning
Width pruning : retain only key leaf nodes for back‑propagation, cutting gradient computation.
Depth pruning : skip back‑propagation through selected layers while keeping forward passes and reward evaluation, further lowering overhead.
Experimental Results
Image Alignment (HPDv2.1)
BranchGRPO achieves higher alignment scores (HPS‑v2.1 0.363–0.369, ImageReward 1.319) and reduces iteration time from 698 s (DanceGRPO) to 493 s (full), 314 s (pruned) and 148 s (Mix variant), a speed‑up of up to 4.7×.
Video Generation (WanX‑1.3B)
Compared with baseline models that exhibit flickering and deformation, BranchGRPO produces sharper frames with richer details and consistent temporal coherence. An iteration takes ≈ 8 min versus ≈ 20 min for DanceGRPO, more than doubling training efficiency.
Ablation Studies
Moderate branching factor and early dense splitting accelerate reward improvement; weighted reward fusion stabilizes training; depth pruning yields the best final performance; mixed ODE‑SDE scheduling provides the fastest training while preserving stability.
Diversity Preservation
Maximum Mean Discrepancy (MMD²) ≈ 0.019, comparable to sequential sampling, indicating that branching does not compromise sample diversity.
Scalability (Scaling Law)
Increasing the branch factor or depth continuously improves performance. For a batch of 81 samples, DanceGRPO requires 2400 s per iteration, whereas BranchGRPO finishes in 680 s.
Conclusion and Outlook
BranchGRPO converts sparse terminal rewards into dense, throughout‑process signals, achieving up to five‑fold faster convergence and higher alignment quality on both image and video tasks. Future work may explore adaptive branching/pruning and extensions to multimodal or larger‑scale generation, positioning BranchGRPO as a core technique for RLHF in diffusion and flow models.
@article{li2025branchgrpo, title={BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models}, author={Li, Yuming and Wang, Yikai and Zhu, Yuying and Zhao, Zhongyu and Lu, Ming and She, Qi and Zhang, Shanghang}, journal={arXiv preprint arXiv:2509.06040}, year={2025}}
Paper: https://arxiv.org/pdf/2509.06040
Project page: https://fredreic1849.github.io/BranchGRPO-Webpage/
Code: https://github.com/Fredreic1849/BranchGRPO
PKU HMI Lab: https://pku-hmi-lab.github.io/HMI-Web/index.html
Code example
来源:机器之心
本文
约2400字
,建议阅读
10
分钟
BranchGRPO 通过树形分叉、奖励融合与轻量剪枝,创新性地融合了效率与稳定。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
