Artificial Intelligence 15 min read

StableReinforce and R1-Reward: Enhancing Multimodal Reward Models with Reinforcement Learning

This article presents StableReinforce and the R1-Reward model, demonstrating how reinforcement learning techniques can stabilize training and significantly improve the performance of multimodal reward models for large language models across several benchmarks.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
StableReinforce and R1-Reward: Enhancing Multimodal Reward Models with Reinforcement Learning

Multimodal Reward Models (MRMs) are crucial for improving multimodal large language models (MLLMs) by providing stable rewards during training and selecting better samples during evaluation.

Recent work by teams from Kuaishou, the Chinese Academy of Sciences, Tsinghua University and Nanjing University shows that directly applying existing RL algorithms such as Reinforce++ to train MRMs leads to instability and crashes.

The paper introduces MM‑RLHF (ICML 2025) and a new algorithm called StableReinforce, which adds several stabilizing techniques: Pre‑Clip to bound the probability‑ratio, an Advantage Filter to limit extreme advantage values, and a Consistency Reward that uses a separate large model as a judge to verify that the model’s analysis matches its final answer.

Training is performed in two stages. First, a rule‑based RL formulation is used to create a 200k preference dataset (R1‑Reward‑200k) via GPT‑4o‑generated reasoning and answer pairs; then a progressive‑difficulty curriculum selects hard samples for RL fine‑tuning.

Experiments on multiple multimodal reward benchmarks (VL‑Reward‑Bench, Multimodal Reward Bench, MM‑RLHF‑Reward Bench) show that the R1‑Reward model outperforms previous SOTA by 5‑15 % and gains further improvements when multiple inference samples are voted (e.g., 5‑sample voting raises accuracy from ~71 % to 85 %). Test‑time scaling and “any‑correct” voting almost reach 100 % accuracy with 15 samples.

Additional findings include a ~15 % reduction in average output length after RL training, indicating more efficient reasoning, and emergent self‑correction behavior when the model detects inconsistencies in its own analysis.

The authors release the training code ( https://github.com/yfzhang114/r1_reward ) and the pretrained model ( https://huggingface.co/yifanzhang114/R1-Reward ), and discuss future directions such as more sophisticated test‑time aggregation methods.

AILLMreinforcement learningmultimodal reward modelR1-Rewardstablereinforce
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.