Reproducing OpenAI o1: Steiner Model’s Reasoning, Training, and Evaluation

This report details the design, data synthesis, three‑stage training pipeline, and benchmark evaluation of the open‑source Steiner reasoning model, which aims to emulate OpenAI o1’s inference‑time scaling while highlighting current performance gaps and future research challenges.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Reproducing OpenAI o1: Steiner Model’s Reasoning, Training, and Evaluation

TL;DR

Steiner is a reasoning model that explores multiple inference paths autoregressively, can backtrack and verify when needed, and is trained in three steps: DAG‑based data generation, sampling of backtracking paths, and reinforcement learning with heuristic rewards; it gains +5.56 on GPQA‑Diamond but does not yet reproduce o1’s inference‑time scaling.

Introduction

Steiner is a series of reinforcement‑learning‑trained reasoning models that generate and verify multiple reasoning paths within a single context, effectively performing a linear traversal of an implicit search tree.

Background

OpenAI o1 introduced reasoning tokens that enable inference‑time scaling, improving performance by allocating more compute during inference. Unlike tree‑search or agentic frameworks, o1 appears to be a single model that still performs a form of search, prompting curiosity about its underlying mechanisms.

Method

Data synthesis faced two main challenges: most existing reasoning datasets contain only forward CoT steps without genuine backtracking nodes. To address this, two augmentation strategies were employed:

Randomly truncate shortcut datasets, hide the correct answer, and let a strong LLM continue reasoning from the prefix, then provide the answer to create backtracking samples.

Cluster the generated steps, assign unique IDs, construct a directed acyclic graph (DAG) for each question, and randomly sample from the DAG to obtain many reasoning‑path examples.

From these processes, 10 K DAGs were built and 50 K backtracking‑enabled reasoning paths were sampled (average 1 600 reasoning tokens per sample, filtered to ≤4 096 tokens and total ≤8 192 tokens).

Training proceeded in three stages:

Continual Pre‑Training (CPT) : Mixed regular text and reasoning paths to teach the model long‑range reasoning and embed 14 special tokens.

Supervised Fine‑Tuning (SFT) : Used a chat template to teach the model to name each step, output a full thought, summarize it, reflect, and decide whether to proceed, backtrack, or finish.

Reinforcement Learning with Step‑Level Reward (RL) : Designed heuristic rewards based on each DAG node’s in‑degree, out‑degree, distance from the question, and distance to the answer, encouraging balanced exploration depth and breadth.

Evaluation

Performance was measured on the GPQA‑Diamond benchmark (no CoT prompting). Adding the RL stage improved scores by +3.53, and applying a specialized logits processor yielded an additional +5.56 gain.

The benchmark was chosen because o1‑mini shows a large improvement on it and its contamination level is low. However, Steiner’s performance on MMLU and other datasets is comparable to baselines, likely due to limited world‑knowledge in the 32 B model.

Limitations

Current post‑training data lacks multi‑turn dialogue samples; the best Steiner‑preview model (based on Qwen2.5‑32B) does not support multi‑turn conversations.

Custom system prompts or temperature changes are discouraged, as the model was not trained for diverse prompts and may produce malformed reasoning tokens.

Training data is ~90 % English; reasoning tokens are therefore predominantly English, even if the final answer may contain some Chinese.

Inference‑time scaling experiments did not show improvements; possible causes include insufficient CPT/SFT for long outputs, suboptimal RL rewards, context‑driven backtrack errors, and overly aggressive logits modifications.

References

@misc{ji2024steiner,
    title = {A Small Step Towards Reproducing OpenAI o1: Progress Report on the Steiner Open Source Models},
    url = {https://medium.com/@peakji/b9a756a00855},
    author = {Yichao Ji},
    month = {October},
    year = {2024}
}
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMopen-source AIReasoning Modelsbenchmark evaluationInference Scaling
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.