Artificial Intelligence 11 min read

Rank‑Only Rewards Accelerate One‑Step Text‑to‑Image Preference Optimization 3.5×

DrPO introduces a drifting‑field based, rank‑only reward mechanism for one‑step text‑to‑image models, enabling reinforcement‑learning‑after‑training without back‑propagating reward gradients; it speeds up training 3.51× versus DRaFT, works with non‑differentiable rewards, and improves generation quality on SD‑Turbo and SDXL‑Turbo.

Machine Learning Algorithms & Natural Language Processing

Jun 21, 2026

Rank‑Only Rewards Accelerate One‑Step Text‑to‑Image Preference Optimization 3.5×

Recent advances in one‑step generative models have reduced reliance on diffusion‑model distillation, but the lack of denoising trajectories makes many preference‑optimization methods difficult to apply. He Kaiming’s drifting model introduced a "drift field" that provides an update direction for the current generation distribution, guiding it toward the real data distribution without needing explicit trajectory signals.

Building on this idea, the DrPO (Drifting Preference Optimization) method applies the drift field to reinforcement‑learning‑after‑training for one‑step text‑to‑image models. In each training step, the current model samples a set of candidate images for a given text prompt; a target reward function ranks these candidates. High‑scoring images become positive samples, low‑scoring images become negative samples, forming a locally constructed preference drift. The drift is estimated in feature space using a kernel similarity function, preserving the attraction/repulsion structure of the original drifting model. The update direction combines this preference drift with a reference drift (derived from a reference model) that enforces a KL‑based distribution constraint. The model then regresses its current samples toward the drift target, with a hyper‑parameter \(\lambda\) controlling drift strength.

Because the reward only participates in ranking, DrPO avoids back‑propagating through large multimodal reward models. Using the HPSv3 reward, DrPO updates take 6.17 seconds per batch versus 21.62 seconds for DRaFT, achieving a 3.51× speed‑up. The speed gain stems from eliminating the reward‑gradient back‑propagation path.

DrPO also supports non‑differentiable rewards. Experiments replace the reward with GenEval scores, which assess object count, color, position, and attribute constraints. Even though GenEval is non‑differentiable, DrPO can incorporate its scores for online fine‑tuning, and the fine‑tuned models show improvements on the corresponding sub‑tasks.

Extensive experiments on SD‑Turbo and SDXL‑Turbo (using prompts from Pick‑a‑Pic v2 and Parti‑Prompts) evaluate scalar metrics (PickScore, Aesthetic Score, ImageReward) and pairwise preference judgments via Qwen3‑VL. DrPO consistently achieves higher win rates and better quantitative scores than competing one‑step baselines, and qualitative results demonstrate more stable instruction following and visual quality.

Ablation studies reveal that the choice of feature extractor is critical: latent‑MAE features outperform raw pretrained features. Increasing the number of candidate samples improves performance, while the method is relatively insensitive to the specific kernel scale. The reference drift effectively limits deviation from the base model distribution.

An offline variant of DrPO constructs the drift field from pre‑collected image pairs instead of online sampling. This version converges faster than a DPO‑style offline baseline but suffers from distribution shift and potential instability over long training periods.

In summary, DrPO integrates drifting‑model drift estimation into reinforcement‑learning‑after‑training for one‑step text‑to‑image generation. It accelerates training by 3.5× on large reward models, accommodates non‑differentiable rewards, and improves generation quality across multiple benchmarks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

text-to-image reinforcement learning Preference Optimization Drifting Model DrPO HPSv3 SD‑Turbo

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.