Designing Safe, Sample-Efficient, and Robust Reinforcement Learning for Ranking and Diffusion Models
This paper proposes a reinforcement‑learning framework that simultaneously ensures safety, sample efficiency, and robustness, applying a contextual‑bandit perspective to ranking/recommendation systems and text‑to‑image diffusion models, and introduces novel algorithms for safe deployment, variance‑reduced off‑policy estimation, and a LOOP method for generative RL.
Overview
This work studies reinforcement‑learning (RL) algorithms that simultaneously satisfy three desiderata: safety (no performance degradation relative to a baseline policy), sample efficiency, and robustness to model misspecification. All results are derived under a contextual‑bandit formulation and are illustrated on two concrete domains—ranking/recommendation systems and text‑to‑image diffusion models.
Safe deployment in ranking and recommendation systems
The authors first propose a theoretical framework for safe RL in ranking. They derive an exposure‑based generalisation bound that quantifies how the expected utility of a learned policy deviates from that of the logging policy under limited feedback. Using this bound they construct a counterfactual risk‑minimisation (CRM) objective that, even when click‑through data are sparse, guarantees the learned policy’s expected reward is no worse than the logging policy’s reward.
To handle adversarial user behaviour or misspecified user models, the CRM objective is extended with doubly‑robust estimators . These estimators combine direct‑model predictions with importance‑weighted observed outcomes, providing unbiased risk estimates and preserving the safety guarantee under a broader set of conditions. Moreover, the framework introduces an explicit control parameter that bounds the permissible utility loss, allowing practitioners to trade off safety margin against potential performance gains.
Unified variance reduction for single‑action contextual bandits
In the single‑action (one‑slot) bandit setting, the paper unifies several off‑policy estimators—such as importance sampling, self‑normalised importance sampling, and doubly‑robust methods—under a common formulation. Building on this unification, the authors derive a closed‑form optimal baseline that simultaneously minimises the variance of the off‑policy value estimator and the variance of the policy‑gradient estimator. The optimal baseline is expressed as a weighted combination of the estimated reward and the propensity scores, and its use leads to markedly more stable off‑policy learning and faster convergence.
Efficiency–effectiveness trade‑off in generative RL
The final part investigates the tension between sample efficiency and generation quality in generative RL for text‑to‑image diffusion models. A systematic comparison of Proximal Policy Optimization (PPO) and REINFORCE reveals that PPO is highly sample‑efficient but can produce images that diverge from the textual description, whereas REINFORCE yields better semantic alignment at the cost of high variance.
To combine the strengths of both methods, the authors introduce Leave‑One‑Out PPO (LOOP) . LOOP augments the standard PPO clipped objective with:
Multiple diffusion trajectories sampled for each training example, providing richer Monte‑Carlo estimates.
A REINFORCE‑style baseline that is computed by leaving out the current trajectory when estimating the expected return, thereby reducing variance without bias.
The resulting algorithm retains PPO’s sample efficiency while improving the semantic fidelity of generated images, as measured by alignment with textual attributes.
Full paper: https://hdl.handle.net/11245.1/669b7ddf-8c57-44c1-917d-9160ae14c04e
Code example
来源:专知
本文
约1000字
,建议阅读
5
分钟
本论文研究如何设计强化学习(Reinforcement Learning, RL)方法。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
