Artificial Intelligence 18 min read

How to Replicate OpenAI’s o1: A Detailed Step‑by‑Step Guide

This article breaks down the replication of OpenAI’s o1 model into four phases—assessment, journey‑learning foundation, component implementation, and training—while highlighting key challenges such as building scalable long‑thought data, reward models, and policy reasoning trees, and discusses the broader impact of o1’s reasoning abilities.

Fighter's World

Nov 30, 2024

How to Replicate OpenAI’s o1: A Detailed Step‑by‑Step Guide

After OpenAI unveiled the o1 series on September 20, the community quickly began extending scaling laws to inference‑time computing. The paper "O1 Replication Journey – Part 1/2" from Shanghai Jiao‑Tong GAIR Lab is used as the primary source for this guide.

Step 1 – Initial Assessment and Understanding of o1’s Thought Structure

1.1 Evaluate o1’s Performance : Conduct a comprehensive benchmark on tasks such as mathematical reasoning using OlympicArena and high‑school exam datasets to identify strengths and gaps.

1.2 Analyze o1’s Thought Process : Study OpenAI‑released reasoning examples, measuring token count, line count, and keyword frequency to uncover patterns in how o1 tackles increasing difficulty.

1.3 Consult Experts : Invite mathematicians and domain specialists to review complex problem solutions and extract structured representations of the reasoning chains.

Define Long‑Thought Characteristics : Based on the analysis, specify the desired attributes of "long thoughts"—iterative problem solving, key reasoning indicators, alternative path exploration, and detailed step‑by‑step explanations.

Step 2 – Building the Foundation: Journey Learning and Long‑Thought Construction

2.1 Embrace Journey Learning : Shift from the traditional next‑token shortcut learning to a paradigm that encourages models to experience the full exploration process, including trial‑and‑error, reflection, and backtracking.

2.2 Constructing Long Thoughts :

Complete Human Thought Process Annotation : Experts record every reasoning step, reflection, and backtrack to generate high‑quality long‑thought data, which is currently scarce.

Multi‑Agent Debate : Deploy multiple agents representing different reasoning strategies; their interaction yields diverse solution paths ("three cobblers make a Zhuge Liang").

Propose‑Critique Loop : One agent proposes a step, another critiques and suggests corrections, forming an iterative feedback loop that builds a comprehensive reasoning tree.

Tree Search with LLM and Reward : Use a reward‑guided tree‑search algorithm where each node corresponds to a reasoning step, allowing the model to incorporate reflection and backtracking based on reward evaluations.

2.3 Selecting the Best Approach : Choose a construction method that balances computational cost, data availability, and the need for expert input.

Step 3 – Implementing Key Components: Reward Models, Policy Models, and Reasoning Trees

3.1 Develop Process‑Level Reward Models : Build models that assess the correctness of individual reasoning steps, not just final answers. Experiments on MR‑GSM8K and PRM800K compare open‑source options (e.g., Math‑shepherd) with proprietary ones (e.g., O1‑mini).

3.2 Construct On‑Policy Reasoning Trees : Use a policy model (π) capable of single‑step inference, applying beam search and step segmentation to efficiently build trees that can be fine‑tuned on datasets with clear step‑by‑step solutions.

3.3 Derive Long Thoughts from Reasoning Trees : Traverse the trees with depth‑first search (DFS) and optionally enrich the generated long thoughts with GPT‑4o or similar models for fluency.

Step 4 – Training and Evaluation: Fine‑Tuning, Preference Learning, and Human‑AI Collaboration

4.1 Supervised Fine‑Tuning (SFT) :

Journey Learning Phase : Fine‑tune on the constructed long‑thought data to improve error handling, reflection, and backtracking.

Shortcut Learning Phase : Further fine‑tune on datasets such as Abel and PRM800K that contain only correct steps and final answers.

4.2 Direct Preference Learning (DPO) : Apply DPO loss on pairs of correct and incorrect responses (e.g., from the MATH Train dataset) to teach the model to distinguish good from bad solutions.

4.3 Visual Data Analysis Platform : Build a visualization tool for long thoughts, reasoning trees, and model outputs to enable efficient human feedback.

4.4 Human‑AI Collaboration in Annotation : Leverage expert annotations and AI‑driven data augmentation to scale high‑quality training data.

4.5 Iterative Refinement : Continuously evaluate on benchmark suites and human reviews, feeding the results back into reward model and long‑thought improvements.

Key Difficulties

Difficulty 1 – Scaling Long‑Thought, Reward, and Policy Trees : Building a universal, extensible repository of long‑thought data is exponential; a single complex reasoning tree can contain dozens of branches, each with dozens of nodes. Constructing complete trees, deciding when to backtrack, and defining step‑level rewards for domains lacking clear quantitative metrics are major challenges.

Domain‑specific knowledge, especially the data describing how experts solve complex problems, largely resides in experts’ minds and is difficult to record comprehensively. Public internet data used in LLM pre‑training lacks sufficient scale of such high‑quality data, making it hard for current LLMs to acquire these abilities. Moreover, because this data often involves security concerns, it is unlikely to be learned through pre‑training in the future. —David Luan, former CEO of Adept (pre‑acquisition)

Difficulty 2 – Low‑Barrier Deployment in Professional Scenarios : Even with high‑quality long‑thought data, most organizations cannot pre‑train or conduct extensive continual training (CT), limiting the ability to embed specialized knowledge. Approaches such as SFT/CT, Retrieval‑Augmented Generation, or full pre‑training have not yet delivered the expected ROI, largely due to limited data scale and granularity.

Difficulty 3 – Efficient Expert Participation : While experts provide indispensable high‑quality data, coordinating large‑scale expert involvement remains costly. Effective human‑AI collaboration methods are needed to capture fine‑grained reasoning without overwhelming experts.

Impact of o1‑Style Models

Backtracking : The model can recognize dead‑ends during inference and revert to earlier steps to explore alternative paths.

Self‑Correction : Continuous self‑evaluation allows the model to adjust its reasoning trajectory toward correct solutions.

Emergent Abilities : These capabilities arise naturally from the model’s architecture and training rather than being hard‑coded.

Current limitations include weaker performance in highly specialized or closed domains and an inability to fully replace human expertise in complex tasks such as software engineering.

Overall, o1 demonstrates a significant step toward more general, human‑like reasoning, offering longer “thinking” times, error learning, and interpretable reasoning chains, which fuels anticipation for future models like GPT‑5.

References

O1 Replication Journey: A Strategic Progress Report – Part 1

O1 Replication Journey – Part 2: Surpassing O1‑preview through Simple Distillation – Big Progress or Bitter Lesson?

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

reward model AI Reasoning OpenAI o1 journey learning LLM replication reasoning tree

Written by

Fighter's World

Live in the future, then build what's missing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Step 1 – Initial Assessment and Understanding of o1’s Thought Structure

Step 2 – Building the Foundation: Journey Learning and Long‑Thought Construction

Step 3 – Implementing Key Components: Reward Models, Policy Models, and Reasoning Trees

Step 4 – Training and Evaluation: Fine‑Tuning, Preference Learning, and Human‑AI Collaboration

Key Difficulties

Impact of o1‑Style Models

References

Fighter's World

How this landed with the community

Was this worth your time?

0 Comments

Step 1 – Initial Assessment and Understanding of o1’s Thought Structure

Step 2 – Building the Foundation: Journey Learning and Long‑Thought Construction

Step 3 – Implementing Key Components: Reward Models, Policy Models, and Reasoning Trees

Step 4 – Training and Evaluation: Fine‑Tuning, Preference Learning, and Human‑AI Collaboration