Can Diffusion Models Be Their Own Reward Model? Latent Reward Modeling & Step-Level Preference Optimization
This article presents a novel paradigm—Latent Reward Model (LRM) and Latent Preference Optimization (LPO)—that repurposes diffusion models as noise‑aware latent reward models for step‑level preference optimization, addressing the shortcomings of pixel‑level reward models, introducing multi‑preference consistent filtering, and demonstrating significant performance and efficiency gains on benchmarks such as PickScore and T2I‑CompBench++.
Research Background
Preference optimization for text‑to‑image diffusion models aims to align generated images with human aesthetic and semantic judgments. Existing pipelines typically treat a visual‑language model (e.g., CLIP, PickScore) as a pixel‑level reward model and fine‑tune the diffusion generator via reinforcement learning or contrastive learning. Pixel‑level rewards require decoding the latent into an image at every diffusion step, which incurs large computational overhead and degrades performance on high‑noise steps because the reconstructed image is heavily blurred.
Core Idea
The authors propose two tightly coupled components:
Latent Reward Model (LRM) : a noise‑aware reward model that operates directly on the diffusion latent space, thus avoiding image reconstruction and inheriting the diffusion model’s intrinsic timestep awareness.
Latent Preference Optimization (LPO) : a step‑wise preference‑guided sampling procedure that uses LRM scores to keep high‑quality latent samples throughout the entire denoising trajectory (t ∈ [0, 950]).
Method Framework
1. Latent Reward Model (LRM)
LRM extracts visual features from the diffusion model’s backbone (U‑Net or DiT) and concatenates them with text embeddings from a language encoder (e.g., T5 or CLIP‑text). A Visual Feature Enhancement (VFE) module, trained with a classifier‑free guidance signal, amplifies text‑relevant visual channels. The joint representation is projected to a scalar preference score via a dot‑product layer. Training uses a Bradley‑Terry (BT) loss on pairwise image‑pair comparisons:
loss = -log( sigma( s_i - s_j ) )
# where s_i, s_j are LRM scores for the winner and loser pairBecause the score is computed on noisy latents, the model naturally conditions on the diffusion timestep, providing built‑in noise robustness.
2. Multi‑Preference Consistent Filtering (MPCF)
High‑quality supervision pairs are essential for stable BT training. MPCF filters public preference datasets (e.g., Pick‑a‑Pic) by enforcing consistency across several metrics:
Aesthetic score (e.g., PickScore).
CLIP image‑text similarity.
Visual Question Answering (VQA) relevance.
Only pairs where the winner outperforms the loser on **all** metrics are retained, eliminating contradictory signals that would otherwise confuse the reward model.
3. Latent Preference Optimization (LPO)
During generation, LPO performs online sampling in the latent space. For each diffusion step t, a batch of latent candidates is sampled from the model’s prior. The LRM evaluates each candidate; the top‑k (or those above a score threshold) are kept, while the rest are discarded. The retained latents are then fed into the next denoising step. This loop runs for the full schedule t ∈ [0, 950], unlike prior step‑wise methods such as SPO that stop at t ≈ 750. Pseudocode:
for t in timesteps[::-1]: # reverse order from high to low noise
latents = sampler.sample_step(prev_latents, t)
scores = LRM(latents, text, t)
mask = scores > threshold
prev_latents = latents[mask]The procedure yields a denoised image that has been continuously steered toward higher human preference, even under heavy noise.
Experimental Results
Benchmarks include PickScore, ImageReward, HPSv2, Aesthetic Score, and the challenging T2I‑CompBench++ suite. LPO consistently outperforms prior methods:
Human‑preference score improvements of >10 % on color and texture generation in T2I‑CompBench++.
Higher CLIP‑based text‑image alignment and aesthetic ratings across all standard metrics.
Training efficiency is dramatically increased. On the SDXL backbone:
LPO requires 92 hours of GPU training.
Diffusion‑DPO needs ≈ 2 560 hours (≈ 28× longer).
SPO needs ≈ 234 hours (≈ 2.5× longer).
These gains stem from eliminating per‑step VAE decoding and from the lightweight BT loss on latent scores.
Conclusion and Outlook
The paper introduces a systematic pipeline that converts a diffusion model into a latent‑space reward model (LRM) and leverages it for step‑aware preference optimization (LPO). The approach removes the bottleneck of pixel‑level reconstruction, provides intrinsic timestep awareness, and scales efficiently to large diffusion backbones (e.g., SDXL, DiT, Flow‑matching SD3). Future work includes extending LRM/LPO to video diffusion models and integrating them with other reinforcement‑learning algorithms for broader generative AI alignment.
Paper: https://arxiv.org/abs/2502.01051
Code repository: https://github.com/Kwai-Kolors/LPO/tree/main
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
