DrPO: Ranking‑Only Rewards Boost One‑Step Text‑to‑Image Preference Optimization by 3.51×

DrPO introduces a ranking‑only reward that builds a drift field from on‑policy image samples to fine‑tune one‑step text‑to‑image models, achieving up to 3.51× faster training on large multimodal rewards, supporting non‑differentiable signals, and demonstrating superior quality across multiple benchmarks.

Machine Heart
Machine Heart
Machine Heart
DrPO: Ranking‑Only Rewards Boost One‑Step Text‑to‑Image Preference Optimization by 3.51×

Background

Recent advances have steadily improved the performance of one‑step text‑to‑image generation models, while training methods have moved away from reliance on pretrained diffusion model distillation. However, signals such as denoising trajectories and policy likelihoods are no longer readily available, making it difficult to apply existing preference‑optimization techniques.

Drifting Preference Optimization (DrPO)

Building on the drifting model proposed earlier this year by He Kaiming’s team, the authors from Westlake University and The Chinese University of Hong Kong (Shenzhen) introduce Drifting Preference Optimization (DrPO). DrPO incorporates the “drift field” into reinforcement‑learning‑after‑training for one‑step text‑to‑image models. In DrPO the reward is used only for ranking candidate images; it does not back‑propagate gradients.

For a given text prompt, the current model samples a set of candidate images on‑policy. High‑scoring images generate an attractive force, low‑scoring images generate a repulsive force, and a reference model provides an additional constraint. The drift field thus gives an update direction that pushes the generation distribution toward the real data distribution without requiring explicit denoising trajectories.

Formalization

Let the feature of a current sample be Z. The high‑score and low‑score image features obtained from reward ranking are denoted as

. A kernel

measures feature similarity. The authors write the online‑constructed positive‑negative relationship as a dipole reward function (shown in the following figure):

.

The local potential field is directly built from the batch ranking results: the closer a current sample is to a high‑score image, the larger the positive term; the closer it is to a low‑score image, the larger the negative term. The update direction is the gradient of this function:

This gradient retains the attraction/repulsion structure of the drifting model: positive samples contribute attraction, negative samples contribute repulsion, and the kernel similarity scales the influence.

DrPO also incorporates a KL‑divergence constraint to prevent the model from drifting too far from the base distribution. The full update direction combines the preference drift (derived from the reward ranking) and the reference drift (derived from the reference model), as illustrated below:

After obtaining the drift direction, DrPO turns it into a regression target for the current sample:

The hyperparameter \lambda (shown in the figure) controls the intensity of the drift field, while a stop‑gradient flag determines whether gradients flow through the drift estimation.

Experimental Validation

The authors first verify that the drift direction constructed by DrPO consistently improves one‑step text‑to‑image models. They fine‑tune SD‑Turbo and SDXL‑Turbo on‑line using prompts from Pick‑a‑Pic v2 and evaluate on both the Pick‑a‑Pic v2 test set and Parti‑Prompts.

Beyond scalar metrics such as PickScore, Aesthetic Score, and ImageReward, the paper employs Qwen3‑VL for pairwise preference comparison, assessing semantic fidelity, overall coherence, image defects, and aesthetic quality. Across both benchmark suites, DrPO achieves higher win rates than multiple one‑step baselines.

Quantitative results on SD‑Turbo and SDXL‑Turbo also show that DrPO improves PickScore, AES, and ImageReward compared with other methods that do not rely on reward gradients. Qualitatively, DrPO produces images that are more consistent with the instruction and visually higher‑quality.

Training Speedup with Large Reward Models

Using the large multimodal reward model HPSv3, DrPO requires only forward scoring and ranking, whereas DRaFT back‑propagates through HPSv3. Under the same effective batch size, DRaFT takes 21.62 s per update, while DrPO takes 6.17 s, a 3.51× speedup.

The difference stems from the gradient path: DRaFT must back‑propagate the reward gradient through HPSv3, while DrPO estimates the drift in feature space and updates the generator via a regression loss, eliminating the heavy backward pass through the reward model.

Handling Non‑Differentiable Rewards

Because the reward only participates in ranking, DrPO can also incorporate non‑differentiable evaluation signals. The authors use GenEval scores, which assess object count, color, position, and attribute binding, as rewards. Fine‑tuning SD‑Turbo on each GenEval sub‑task leads to improvements on the corresponding categories, confirming that DrPO can integrate rule‑based or programmatic scores.

Ablation Studies

Ablations highlight the role of the feature space. The drift direction is estimated from similarity in feature space rather than directly from the reward model; thus, the choice of feature extractor matters. Experiments show that latent‑MAE features outperform using the generator’s own features. Additional findings:

Insufficiently expressive features (e.g., lacking count, layout, text, fine‑grained identity) can make the drift unreliable.

Increasing the number of candidate samples improves performance.

The method is relatively insensitive to the choice of kernel function.

Reference drift (using a reference model as positive samples) helps keep the fine‑tuned model close to the base distribution.

Offline Variant

The paper also explores an offline version of DrPO that constructs the drift field from a static preference dataset rather than on‑policy samples. This variant converges faster than a DPO‑style baseline for one‑step models, but suffers from distribution‑shift issues because offline image pairs may lie far from the current model’s distribution, leading to coarse drift estimates and potential training collapse over long fine‑tuning.

Paper Information

Paper title: Drifting Preference Optimization for One‑Step Generative Models

Project page: https://ugvly.github.io/DrPO/

ArXiv link: https://arxiv.org/abs/2606.02521

Code: https://github.com/UGVly/DrPO

Conclusion

DrPO brings the drift‑field estimation from drifting models into reinforcement‑learning‑after‑training for one‑step text‑to‑image generation. Each step samples candidates, the reward ranks them, and the resulting high‑/low‑score samples form a preference drift; a reference model provides a reference drift for KL regularization. The model then regresses toward the combined drift target. Experiments demonstrate that DrPO improves generation quality on SD‑Turbo and SDXL‑Turbo, achieves a 3.51× training speedup when using large multimodal rewards, and can incorporate non‑differentiable rewards such as GenEval.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Reinforcement Learningpreference learningdrift fieldDrifting Preference Optimizationnon-differentiable rewardone-step text-to-imagetraining speedup
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.