Decoding OpenAI’s o1: How RL and Process‑Supervised Reward Models Might Power the Next LLM

The author speculates on OpenAI’s o1 architecture, proposing that it relies on reinforcement learning guided by a generalizable, process‑supervised reward model, and outlines data collection, multi‑model generation, and training tweaks needed to realize such a system.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Decoding OpenAI’s o1: How RL and Process‑Supervised Reward Models Might Power the Next LLM

Reward Model Design

The only confirmed detail about OpenAI’s upcoming o1 model is that it uses reinforcement learning (RL) guided by a reward model. Two design principles are emphasized:

Generality: The reward model should not be task‑specific (e.g., only math or code) but should reward outputs that are longer, more diverse, and exhibit broadly useful capabilities.

Process supervision: Instead of scoring only final outcomes (outcome‑reward model, ORM), the model should predict a score for each reasoning step. This aligns with the process‑supervised reward model (PRM) described in https://arxiv.org/abs/2305.20050 , where each intermediate step receives a numeric rating that the reward model learns to predict.

Data Collection for PRM

Math problems are ideal because they are “hard‑to‑generate, easy‑to‑verify.” The pipeline would:

Generate a solution trace for a math problem.

Record each intermediate reasoning step.

Assign a score to each step reflecting the probability that the step leads to a correct final answer.

Automatic annotation can follow methods from https://arxiv.org/abs/2312.08935 and https://arxiv.org/abs/2406.06592 , which estimate step‑wise usefulness via predictive models.

Generating Long Chain‑of‑Thought (CoT) Data

Single‑model prompting struggles to produce sufficiently long, high‑quality CoT sequences (see https://arxiv.org/abs/2408.07055 ). A more scalable approach is to use multiple agents—or the same model with different system messages—that specialize in planning, elaboration, critique, and verification. This multi‑stage generation resembles the framework explored in https://arxiv.org/abs/2409.12917 , though current implementations use few iterations.

Training Procedure

Pure self‑play (as in AlphaZero) is unlikely to be sufficient because language models must interact with human‑like data. A practical training loop could be:

Fine‑tune the base model on the collected PRM dataset (effectively a form of reject‑sampling SFT).

Apply RL using a modified PPO algorithm that emphasizes sampling efficiency (inspired by https://arxiv.org/abs/2410.01679 ).

Introduce regularization terms or penalties that control reasoning length to prevent runaway CoT generation.

Inference Hypothesis

The author speculates that o1 will generate a single, long CoT rather than relying on multiple agents or Monte‑Carlo Tree Search (MCTS) at inference time. MCTS‑like techniques might still be useful during data generation or RL to improve sample efficiency.

Open‑Source References

Relevant implementation work includes:

Reward‑model support added to vllm (Qwen2.5‑math‑72B‑rm example): https://github.com/vllm-project/vllm/pull/8896

PRM training integrated into OpenRLHF: https://github.com/OpenRLHF/OpenRLHF/pull/442

作者:朱小霖

链接:https://zhuanlan.zhihu.com/p/839732117
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMReward ModelAI researchRLHFo1process-supervised
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.