How MCTS Powers Inference in OpenAI’s o1: A Deep Dive with rStar

This article explains how the inference component of OpenAI’s o1 model can be implemented using Monte‑Carlo Tree Search, detailing the action space, rollout process, UCT scoring, and best‑path selection, with a concrete walkthrough of Microsoft’s open‑source rStar code.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
How MCTS Powers Inference in OpenAI’s o1: A Deep Dive with rStar

In the overall o1 framework, the inference stage is responsible for improving logical reasoning by allocating more compute during inference. This piece expands the abstract framework by focusing on the inference block, describing two main purposes: (1) direct inference optimization (e.g., PRM + search, MCTS) and (2) post‑training data selection.

Why rStar?

rStar is an open‑source project (https://github.com/zhentingqi/rStar) that implements inference‑only optimization using MCTS rather than PRM‑based methods, making it suitable for small language models where labeling costs are high.

Action Space (A1‑A5)

A1 (propose a one‑step‑thought) : generate step‑by‑step reasoning, each step ends with a partial answer; the final step ends with "The answer is …".

A2 (propose the remaining thought steps) : perform a single‑shot reasoning that directly yields the final answer.

A3 (propose next sub‑question and answer) : decompose the original question into sub‑questions and answer them sequentially.

A4 (re‑answer sub‑question) : re‑generate the answer for a previously produced sub‑question, typically using the A2 template.

A5 (rephrase the question) : extract the essential conditions from a long question and restate it concisely.

Prompts for each action are stored in rStar/prompts. The inference module can be assembled from these actions to form a search tree.

Search Tree Construction with MCTS

The model builds a tree from the user question (root) using the standard MCTS phases:

Select : traverse from the root, choosing unexplored nodes or those with the highest UCT value.

Expand : generate all permissible child actions according to the dependency rules (e.g., A1 can spawn A1/A2, A3 can spawn A1‑A4, etc.).

Simulate : randomly sample a child node repeatedly until a leaf node (terminal A2, A3, or final A1) or a maximum depth is reached.

Backpropagate : update reward and freq for every node on the visited path.

A single iteration of these steps is called a rollout . Multiple rollouts (default 16) gradually expand a near‑complete tree.

UCT Value

The UCT score for a node is computed as:

UCT formula
UCT formula

where Q is cumulative reward, N is visit count, N_parent is the parent’s visit count, and c balances exploration vs. exploitation. Early rollouts use a larger c (more exploration); later rollouts reduce c (more exploitation).

Selecting the Best Path

After building the tree, solution nodes are identified as terminal A2, A3, or final A1 nodes. Each solution node’s answer is grouped, and a vote‑based score (answer frequency / total solutions) is computed. Additionally, a prior weight is calculated by multiplying depth‑wise consistency scores, reflecting how well each step aligns with the final answer.

The final score for a solution node is vote_score × prior_weight. Two selection strategies are possible:

Pick the node with the highest final score (default).

Sample nodes proportionally to their scores (implemented but not used).

Using a Discriminator

rStar also offers a GAN‑like approach: the generated tree acts as a generator, and a separate small model (discriminator) evaluates consistency. For each candidate path, the text up to a randomly chosen reasoning step is masked, and the discriminator predicts the remaining answer. If the discriminator’s answer matches the generator’s, the path is considered trustworthy.

Conclusion

The article demonstrates, through the rStar codebase, how MCTS can be applied to LLM inference for logical reasoning tasks, detailing the action space, tree‑building algorithm, UCT scoring, and path selection mechanisms. Future posts will extend the framework with additional modules.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsInferenceMCTSSearch AlgorithmsOpenAI o1rStar
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.