Designing Safe, Sample-Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

This paper proposes a reinforcement‑learning framework that simultaneously ensures safety, sample efficiency, and robustness, applying a contextual‑bandit perspective to ranking/recommendation systems and text‑to‑image diffusion models, and introduces novel algorithms for safe deployment, variance‑reduced off‑policy estimation, and a LOOP method for generative RL.

Data Party THU
Data Party THU
Data Party THU
Designing Safe, Sample-Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models

Overview

This work studies reinforcement‑learning (RL) algorithms that simultaneously satisfy three desiderata: safety (no performance degradation relative to a baseline policy), sample efficiency, and robustness to model misspecification. All results are derived under a contextual‑bandit formulation and are illustrated on two concrete domains—ranking/recommendation systems and text‑to‑image diffusion models.

Safe deployment in ranking and recommendation systems

The authors first propose a theoretical framework for safe RL in ranking. They derive an exposure‑based generalisation bound that quantifies how the expected utility of a learned policy deviates from that of the logging policy under limited feedback. Using this bound they construct a counterfactual risk‑minimisation (CRM) objective that, even when click‑through data are sparse, guarantees the learned policy’s expected reward is no worse than the logging policy’s reward.

To handle adversarial user behaviour or misspecified user models, the CRM objective is extended with doubly‑robust estimators . These estimators combine direct‑model predictions with importance‑weighted observed outcomes, providing unbiased risk estimates and preserving the safety guarantee under a broader set of conditions. Moreover, the framework introduces an explicit control parameter that bounds the permissible utility loss, allowing practitioners to trade off safety margin against potential performance gains.

Unified variance reduction for single‑action contextual bandits

In the single‑action (one‑slot) bandit setting, the paper unifies several off‑policy estimators—such as importance sampling, self‑normalised importance sampling, and doubly‑robust methods—under a common formulation. Building on this unification, the authors derive a closed‑form optimal baseline that simultaneously minimises the variance of the off‑policy value estimator and the variance of the policy‑gradient estimator. The optimal baseline is expressed as a weighted combination of the estimated reward and the propensity scores, and its use leads to markedly more stable off‑policy learning and faster convergence.

Efficiency–effectiveness trade‑off in generative RL

The final part investigates the tension between sample efficiency and generation quality in generative RL for text‑to‑image diffusion models. A systematic comparison of Proximal Policy Optimization (PPO) and REINFORCE reveals that PPO is highly sample‑efficient but can produce images that diverge from the textual description, whereas REINFORCE yields better semantic alignment at the cost of high variance.

To combine the strengths of both methods, the authors introduce Leave‑One‑Out PPO (LOOP) . LOOP augments the standard PPO clipped objective with:

Multiple diffusion trajectories sampled for each training example, providing richer Monte‑Carlo estimates.

A REINFORCE‑style baseline that is computed by leaving out the current trajectory when estimating the expected return, thereby reducing variance without bias.

The resulting algorithm retains PPO’s sample efficiency while improving the semantic fidelity of generated images, as measured by alignment with textual attributes.

Full paper: https://hdl.handle.net/11245.1/669b7ddf-8c57-44c1-917d-9160ae14c04e

Code example

来源:专知
本文
约1000字
,建议阅读
5
分钟
本论文研究如何设计强化学习(Reinforcement Learning, RL)方法。
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Safetydiffusion modelsreinforcement learningRobustnesscontextual banditssample efficiencyranking systems
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.