Reproducing OpenAI o1: Steiner Model’s Reasoning, Training, and Evaluation
This report details the design, data synthesis, three‑stage training pipeline, and benchmark evaluation of the open‑source Steiner reasoning model, which aims to emulate OpenAI o1’s inference‑time scaling while highlighting current performance gaps and future research challenges.
TL;DR
Steiner is a reasoning model that explores multiple inference paths autoregressively, can backtrack and verify when needed, and is trained in three steps: DAG‑based data generation, sampling of backtracking paths, and reinforcement learning with heuristic rewards; it gains +5.56 on GPQA‑Diamond but does not yet reproduce o1’s inference‑time scaling.
Introduction
Steiner is a series of reinforcement‑learning‑trained reasoning models that generate and verify multiple reasoning paths within a single context, effectively performing a linear traversal of an implicit search tree.
Background
OpenAI o1 introduced reasoning tokens that enable inference‑time scaling, improving performance by allocating more compute during inference. Unlike tree‑search or agentic frameworks, o1 appears to be a single model that still performs a form of search, prompting curiosity about its underlying mechanisms.
Method
Data synthesis faced two main challenges: most existing reasoning datasets contain only forward CoT steps without genuine backtracking nodes. To address this, two augmentation strategies were employed:
Randomly truncate shortcut datasets, hide the correct answer, and let a strong LLM continue reasoning from the prefix, then provide the answer to create backtracking samples.
Cluster the generated steps, assign unique IDs, construct a directed acyclic graph (DAG) for each question, and randomly sample from the DAG to obtain many reasoning‑path examples.
From these processes, 10 K DAGs were built and 50 K backtracking‑enabled reasoning paths were sampled (average 1 600 reasoning tokens per sample, filtered to ≤4 096 tokens and total ≤8 192 tokens).
Training proceeded in three stages:
Continual Pre‑Training (CPT) : Mixed regular text and reasoning paths to teach the model long‑range reasoning and embed 14 special tokens.
Supervised Fine‑Tuning (SFT) : Used a chat template to teach the model to name each step, output a full thought, summarize it, reflect, and decide whether to proceed, backtrack, or finish.
Reinforcement Learning with Step‑Level Reward (RL) : Designed heuristic rewards based on each DAG node’s in‑degree, out‑degree, distance from the question, and distance to the answer, encouraging balanced exploration depth and breadth.
Evaluation
Performance was measured on the GPQA‑Diamond benchmark (no CoT prompting). Adding the RL stage improved scores by +3.53, and applying a specialized logits processor yielded an additional +5.56 gain.
The benchmark was chosen because o1‑mini shows a large improvement on it and its contamination level is low. However, Steiner’s performance on MMLU and other datasets is comparable to baselines, likely due to limited world‑knowledge in the 32 B model.
Limitations
Current post‑training data lacks multi‑turn dialogue samples; the best Steiner‑preview model (based on Qwen2.5‑32B) does not support multi‑turn conversations.
Custom system prompts or temperature changes are discouraged, as the model was not trained for diverse prompts and may produce malformed reasoning tokens.
Training data is ~90 % English; reasoning tokens are therefore predominantly English, even if the final answer may contain some Chinese.
Inference‑time scaling experiments did not show improvements; possible causes include insufficient CPT/SFT for long outputs, suboptimal RL rewards, context‑driven backtrack errors, and overly aggressive logits modifications.
References
@misc{ji2024steiner,
title = {A Small Step Towards Reproducing OpenAI o1: Progress Report on the Steiner Open Source Models},
url = {https://medium.com/@peakji/b9a756a00855},
author = {Yichao Ji},
month = {October},
year = {2024}
}Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
