Artificial Intelligence 12 min read

How Does OpenAI’s o1 Achieve Self‑Correction? A Deep Dive into MCTS and SCoRe

Examining OpenAI’s o1 model, this article explores its self‑correction capability by linking test‑time scaling, MCTS‑style reasoning, and DeepMind’s SCoRe reinforcement‑learning framework, illustrating step‑by‑step demos, hypothesizing internal judgment mechanisms, and proposing training pipelines that combine self‑generated data with post‑training RL.

Baobao Algorithm Notes

Oct 11, 2024

How Does OpenAI’s o1 Achieve Self‑Correction? A Deep Dive into MCTS and SCoRe

Observed self‑correction in o1 demos

When expanding the “thought for … seconds” view on the o1 demo page (https://openai.com/index/learning-to-reason-with-llms/), the model frequently inserts tokens such as “but wait” and “alternatively”. The pattern shows a first line of reasoning, a self‑validation cue (“but wait”), and then an alternative approach.

User_question: …
o1: …, But wait, …
Alternatively, …, hmm but, …
Alternatively, …

Link to Monte‑Carlo Tree Search (MCTS)

In an MCTS‑style inference, each token can be seen as expanding a node. When the confidence at a node is insufficient, the model emits “but wait” and creates a child node representing an alternative line of reasoning rather than back‑tracking.

Step‑by‑step example (pH of 0.1 M NH₄F)

Rephrase : Restate the problem and list given quantities.

Propose subquestions : Ask “What ions are present?” and “How do they affect pH?”.

Provide subanswers : Identify NH₄⁺ and F⁻, note NH₄⁺ is a weak acid and F⁻ a weak base.

Emphasize : Re‑emphasize the original task.

First subanswer : Compute a pH estimate but label it “insufficient”.

Self‑corrected subanswer : Generate an alternative calculation after the model judges the first as inadequate.

Final answer : Combine the corrected reasoning chain into the user‑facing solution.

Hypothesis on the judgment mechanism

The model reaches an MCTS node where the available action is “propose a self‑corrected subanswer”.

All preceding nodes provide context for generating the corrected answer.

“Correction” may be a refinement or a restatement, depending on the model’s internal confidence assessment.

This allows the search to continue without explicit backtracking.

SCoRe: Self‑Correction via Reinforcement Learning

DeepMind’s SCoRe paper (https://arxiv.org/abs/2409.12917) proposes a reinforcement‑learning framework that trains a model to generate its own correction data and perform a single correction round.

Training data are entirely self‑generated (on‑policy).

Each training instance consists of

problem + first answer + instruction template + second answer + final answer

Stage 1

Base generator produces a first‑round answer x1 and extracts numeric components y1.

An instruction template p1 prompts a second‑round answer x2 with numeric part y2.

The reward combines a shaped component (reward₂ – reward₁) (scaled by a hyper‑parameter) and a KL‑divergence term that keeps the distribution of x1 close to the base model.

Stage 2

Using the updated model, repeat the two‑round process.

Compute rewards for both rounds, add them, and include the same KL‑regularization.

The shaped reward encourages the second answer to be strictly better; if it is worse, the penalty term reduces the overall objective.

Overall training loop

Iterate Stage 1 and Stage 2 multiple times until a stopping criterion (e.g., validation performance) is met. The resulting model learns to decide when to invoke a self‑corrected answer internally, without external verification.

Conclusion

Observed “but wait” / “alternatively” tokens in o1 suggest an implicit MCTS‑style self‑validation step. Reinforcement‑learning approaches such as SCoRe provide a concrete method to train this capability by using self‑generated data and a reward that favors improvement, enabling large language models to perform reliable self‑repair during inference.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

OpenAI LLM Reasoning MCTS SCoRe self-correction

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Observed self‑correction in o1 demos

Link to Monte‑Carlo Tree Search (MCTS)

Step‑by‑step example (pH of 0.1 M NH₄F)

Hypothesis on the judgment mechanism

SCoRe: Self‑Correction via Reinforcement Learning

Stage 1

Stage 2

Overall training loop

Conclusion

Baobao Algorithm Notes

How this landed with the community

Was this worth your time?

0 Comments

Step‑by‑step example (pH of 0.1 M NH₄F)

Stage 1

Stage 2