How Flow‑GRPO Boosts Image Generation Accuracy to 95% with Online Reinforcement Learning
Flow‑GRPO introduces online reinforcement learning into flow‑matching models by converting deterministic ODE sampling to stochastic SDE sampling and reducing denoising steps, raising SD‑3.5‑Medium's GenEval accuracy from 63% to 95%—surpassing GPT‑4o—and demonstrating strong gains in complex composition, text rendering, and human‑preference alignment across multiple generative tasks.
Core Idea
Flow‑GRPO integrates online reinforcement learning (RL) with flow‑matching generative models. By converting the deterministic ODE sampling of flow models into an equivalent stochastic differential equation (SDE) and by reducing the number of denoising steps during RL data collection, the method enables efficient policy optimization (GRPO) while preserving full‑resolution inference quality.
ODE‑to‑SDE Conversion
Flow‑matching models generate samples by solving the ODE
dx = v(x,t) dt, x_0 \sim p_{data}, x_T \sim \mathcal{N}(0, I)where v(x,t) is the learned velocity field. To introduce stochastic exploration required by GRPO, the authors derive an SDE dx = v(x,t) dt + g(t) dW_t with diffusion coefficient g(t) chosen so that the marginal distribution of the SDE at any time t exactly matches that of the original ODE. The derivation (Appendix A of the paper) proves that the transition kernel of the SDE coincides with the deterministic flow, allowing the same pretrained model to be reused for RL without retraining.
Denoising‑Step Reduction
Online RL requires many roll‑outs, making full‑step sampling (e.g., 40 denoising steps) prohibitively expensive. Flow‑GRPO adopts a “lightweight training, full‑capacity inference” regime: during policy learning the number of integration steps is reduced (e.g., from 40 to 10), which speeds up trajectory generation by roughly four times. At test time the original step count is restored, guaranteeing that image quality and diversity remain unchanged.
Training Procedure
Initialize a pretrained flow‑matching model (e.g., SD‑3.5‑Medium, FLUX‑1‑Dev).
Replace the ODE sampler with the equivalent SDE sampler.
Collect a batch of trajectories by sampling with the reduced‑step SDE.
Compute rewards (e.g., GenEval accuracy, text‑rendering fidelity, human‑preference scores).
Apply Group Relative Policy Optimization (GRPO) to update the policy network that predicts a conditioning signal for the flow model.
Periodically evaluate with the full‑step sampler to monitor image quality.
Experimental Results
Complex composition (GenEval) : accuracy improves from 63 % to 95 %, surpassing GPT‑4o.
Text rendering : accuracy rises from 59 % to 92 % on a dedicated OCR‑based benchmark.
Human‑preference alignment : preference scores increase, indicating better aesthetic alignment.
Reward‑hacking mitigation : stochastic SDE sampling prevents degenerate policies that exploit deterministic trajectories while maintaining diversity.
All experiments use the same pretrained weights for inference, confirming that the gains stem from the RL fine‑tuning rather than larger models.
Resources
Paper: https://www.arxiv.org/pdf/2505.05470
Code repository: https://github.com/yifan123/flow_grpo
Project page: https://gongyeliu.github.io/Flow-GRPO/
Conclusion
Flow‑GRPO shows that online RL can be applied to deterministic flow‑matching generators by an exact ODE‑to‑SDE conversion and by reducing training‑time denoising steps. The approach yields large performance gains on multimodal generation tasks without sacrificing image quality, and the open‑source implementation supports a range of state‑of‑the‑art models (SD‑3.5, FLUX‑1‑Dev, Qwen‑Image, Wan‑2.1, Bagel, etc.), opening a new research direction for controllable, compositional, and reasoning‑enhanced generative AI.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
