Artificial Intelligence 37 min read

How OpenAI’s o1 Uses Self‑Play RL to Achieve Breakthrough Reasoning

This article provides an in‑depth technical analysis of OpenAI’s new multimodal model o1, explaining its self‑play reinforcement‑learning pipeline, novel train‑time and test‑time scaling laws, inference‑time thinking process, and possible architectural variants, while also discussing broader implications for large‑language‑model research.

Baobao Algorithm Notes

Sep 18, 2024

How OpenAI’s o1 Uses Self‑Play RL to Achieve Breakthrough Reasoning

OpenAI recently released a preview of a new multimodal model called o1 , which differs from the GPT‑4 series by employing a self‑play reinforcement‑learning (RL) pipeline. The model achieves strong performance on mathematical reasoning benchmarks and introduces two new scaling laws: one for train‑time compute and another for test‑time compute.

Key Characteristics of o1

o1 is a multimodal model (the "o" in its name stands for "omni"). Its official name emphasizes that it follows a distinct technical path from the GPT‑4 family. The model combines a generator and a verifier that interact through long‑thinking inference, allowing the system to propose hypotheses, test them, and refine answers without human supervision.

We have found that the performance of o1 consistently improves with more reinforcement learning (train‑time compute) and with more time spent thinking (test‑time compute).

The reasoning ability emerges from a prolonged inference phase where the model iteratively explores possible solutions, similar to chain‑of‑thought (CoT) but with much deeper search. This long‑thinking phase is hidden in the public ChatGPT client but can be exposed in the API, spanning roughly 2950 tokens.

Decoding Example (Strawberry Cipher)

The article demonstrates o1’s reasoning by solving a cipher that asks how many "r" letters appear in the word "strawberry". The solution shows that each plaintext letter is encoded by a pair of ciphertext letters; the average of their alphabet positions yields the target letter (e.g., (o=15 + y=25)/2 = 20 → T). Applying this rule to the full ciphertext decodes the message "THERE ARE THREE R'S IN STRAWBERRY".

oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

THERE ARE THREE R'S IN STRAWBERRY

Self‑Play RL in Large Language Models

The article reviews the evolution of learning strategies for LLMs. After the success of RLHF, researchers are revisiting self‑play, which was pioneered in game AI (e.g., AlphaGo). Self‑play promises higher absolute performance and the ability to generate its own training data, potentially overcoming the limits of behavior‑cloning and supervised fine‑tuning.

Behaviour Clone Expert : mimics human data, limited by data bias.

RLHF : aligns with human preferences, but costly and prone to reward hacking.

Self‑play : can surpass human experts, but requires strong generators and verifiers and incurs high compute costs.

For self‑play to be effective, both the generator and verifier must be sufficiently strong. In games, the generator produces actions while the verifier evaluates them; in language, the verifier must judge the quality of generated text, which is a harder problem.

Scaling of Reward Models

Recent work shows that generative reward models (RM) that output natural‑language explanations plus scalar scores scale better than traditional discriminative RMs. This richer feedback improves policy learning and enables more efficient use of negative examples, boosting data‑utilisation efficiency up to eight times compared with using only positive examples.

Test‑Time Inference Scaling

Two main test‑time scaling strategies are discussed:

Best‑of‑N (BoN) search : generate multiple candidates in parallel and select the highest‑scoring one according to the RM. This expands width but remains shallow in depth.

Depth‑wise iterative refinement : akin to guided search or Monte‑Carlo Tree Search (MCTS), where the model repeatedly refines its answer using the verifier’s feedback, allowing both width and depth expansion.

Increasing the inference compute budget (e.g., more thinking steps) consistently improves accuracy, suggesting that o1’s test‑time scaling likely follows the latter, more depth‑oriented approach.

Possible Architectural Variants

The article sketches two plausible RL pipelines for o1:

Actor‑Critic with Separate Generator and Verifier : two models interact in a self‑play loop, with a reward model providing scalar feedback. This yields high performance but requires deploying both models.

Unified Model with Integrated Verification : start with a generator, add step‑wise verification capability, and eventually merge generator and verifier into a single model, reducing deployment complexity.

Both designs rely on actor‑critic updates, TD‑error credit assignment, and may incorporate curriculum learning to gradually increase task difficulty.

Implications and Outlook

o1 demonstrates that post‑training RL (self‑play) can break the pre‑training scaling ceiling, achieving strong reasoning without massive supervised data. However, the approach demands substantial compute (estimated ~100× rollout cost) and sophisticated inference infrastructure (large KV‑cache, long‑thinking management). Future LLMs are likely to adopt similar self‑play pipelines, with research focusing on improving verifier strength, reducing compute overhead, and extending the technique beyond math to broader domains.