Artificial Intelligence 8 min read

How Flow‑GRPO Boosts Image Generation Accuracy to 95% with Online Reinforcement Learning

Flow‑GRPO introduces online reinforcement learning into flow‑matching models by converting deterministic ODE sampling to stochastic SDE sampling and reducing denoising steps, raising SD‑3.5‑Medium's GenEval accuracy from 63% to 95%—surpassing GPT‑4o—and demonstrating strong gains in complex composition, text rendering, and human‑preference alignment across multiple generative tasks.

Kuaishou Tech

Nov 25, 2025

How Flow‑GRPO Boosts Image Generation Accuracy to 95% with Online Reinforcement Learning

Core Idea

Flow‑GRPO integrates online reinforcement learning (RL) with flow‑matching generative models. By converting the deterministic ODE sampling of flow models into an equivalent stochastic differential equation (SDE) and by reducing the number of denoising steps during RL data collection, the method enables efficient policy optimization (GRPO) while preserving full‑resolution inference quality.

ODE‑to‑SDE Conversion

Flow‑matching models generate samples by solving the ODE

dx = v(x,t) dt,   x_0 \sim p_{data},   x_T \sim \mathcal{N}(0, I)

where v(x,t) is the learned velocity field. To introduce stochastic exploration required by GRPO, the authors derive an SDE dx = v(x,t) dt + g(t) dW_t with diffusion coefficient g(t) chosen so that the marginal distribution of the SDE at any time t exactly matches that of the original ODE. The derivation (Appendix A of the paper) proves that the transition kernel of the SDE coincides with the deterministic flow, allowing the same pretrained model to be reused for RL without retraining.

Denoising‑Step Reduction

Online RL requires many roll‑outs, making full‑step sampling (e.g., 40 denoising steps) prohibitively expensive. Flow‑GRPO adopts a “lightweight training, full‑capacity inference” regime: during policy learning the number of integration steps is reduced (e.g., from 40 to 10), which speeds up trajectory generation by roughly four times. At test time the original step count is restored, guaranteeing that image quality and diversity remain unchanged.

Training Procedure

Initialize a pretrained flow‑matching model (e.g., SD‑3.5‑Medium, FLUX‑1‑Dev).

Replace the ODE sampler with the equivalent SDE sampler.

Collect a batch of trajectories by sampling with the reduced‑step SDE.

Compute rewards (e.g., GenEval accuracy, text‑rendering fidelity, human‑preference scores).

Apply Group Relative Policy Optimization (GRPO) to update the policy network that predicts a conditioning signal for the flow model.

Periodically evaluate with the full‑step sampler to monitor image quality.

Experimental Results

Complex composition (GenEval) : accuracy improves from 63 % to 95 %, surpassing GPT‑4o.

Text rendering : accuracy rises from 59 % to 92 % on a dedicated OCR‑based benchmark.

Human‑preference alignment : preference scores increase, indicating better aesthetic alignment.

Reward‑hacking mitigation : stochastic SDE sampling prevents degenerate policies that exploit deterministic trajectories while maintaining diversity.

All experiments use the same pretrained weights for inference, confirming that the gains stem from the RL fine‑tuning rather than larger models.

Resources

Paper: https://www.arxiv.org/pdf/2505.05470

Code repository: https://github.com/yifan123/flow_grpo

Project page: https://gongyeliu.github.io/Flow-GRPO/

Conclusion

Flow‑GRPO shows that online RL can be applied to deterministic flow‑matching generators by an exact ODE‑to‑SDE conversion and by reducing training‑time denoising steps. The approach yields large performance gains on multimodal generation tasks without sacrificing image quality, and the open‑source implementation supports a range of state‑of‑the‑art models (SD‑3.5, FLUX‑1‑Dev, Qwen‑Image, Wan‑2.1, Bagel, etc.), opening a new research direction for controllable, compositional, and reasoning‑enhanced generative AI.

deep learning Image Generation flow matching AI research online-RL

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.