Artificial Intelligence 35 min read

How Does OpenAI’s o1 Model Leverage Self‑Play RL and New Scaling Laws?

The article provides an in‑depth technical analysis of OpenAI’s multimodal o1 model, explaining its self‑play reinforcement‑learning pipeline, the novel train‑time and test‑time compute scaling laws, its long‑think reasoning abilities demonstrated through a cipher example, and speculative architectures for generator‑verifier systems.

Architect

Sep 28, 2024

How Does OpenAI’s o1 Model Leverage Self‑Play RL and New Scaling Laws?

OpenAI o1 Model Overview

OpenAI released o1 , a multimodal large language model trained with a self‑play reinforcement‑learning (RL) pipeline. Unlike previous GPT series, o1 does not rely on human‑feedback fine‑tuning; it improves performance through training‑time RL (policy learning) and inference‑time thinking (long‑think before answering). The model introduces two scaling laws: train‑time compute and test‑time compute.

Reasoning Example

o1 demonstrates step‑by‑step logical deduction on a cipher‑decoding puzzle (the “strawberry” problem). By pairing ciphertext letters, averaging their alphabetic positions, and verifying each hypothesis, the model produces the answer “THERE ARE THREE R'S IN STRAWBERRY”.

Speculated Training Architecture

A plausible architecture consists of a Generator and a Verifier that interact in a self‑play loop, updated by an actor‑critic algorithm with TD‑error. The Reward Model (RM) can be a scalar evaluator or a generative RM that returns natural‑language feedback, enabling richer supervision.

System Designs

Dual‑agent system: Separate Generator, Verifier, and RM models. Provides strong adversarial training but requires three models and higher training cost.

Unified model: A single model that incorporates step‑wise verification internally, reducing deployment complexity while preserving self‑play dynamics.

Both designs aim to keep the Verifier as capable as the Generator to mitigate reward hacking.

Test‑time Scaling Strategies

Scaling at inference time can be achieved with several techniques that trade off width (parallel candidates) and depth (iterative reasoning):

Best‑of‑N (BoN) search: Generate N candidates in parallel and select the one with the highest reward score.

Chain‑of‑thought (CoT) prompting: Guide the model through a sequence of reasoning steps.

Reflection tuning: Iteratively refine the answer based on model‑generated feedback.

Learning‑Strategy Comparison

Behaviour‑clone (expert) learning: Produces human‑like behavior, can be trained with a single agent, and achieves perfect performance with infinite data; however it inherits data bias, cannot explore beyond human behavior, and cannot leverage erroneous data.

RLHF: Aligns with human preferences, utilizes erroneous data, and is data‑efficient; but it is costly to train, difficult to model preferences, and prone to reward hacking.

Self‑play RL: Offers higher absolute strength, can surpass top humans, and achieves optimal zero‑sum strategies; the downside is extremely high training and inference cost and occasional difficulty in understanding humans.

Cipher‑decoding Code Example

oyfjdnisdr rtqwainr acxz mynzbhhx -> Think step by step

The decoding method averages the numeric values (A=1, B=2, …) of each pair of ciphertext letters, divides by two, and maps the result back to a letter, yielding the phrase “THERE ARE THREE R'S IN STRAWBERRY”.

References

https://www.youtube.com/watch?v=06VsbwJkrIo

https://arxiv.org/pdf/2408.15240

https://arxiv.org/pdf/2406.14532

https://arxiv.org/pdf/2408.03314v1

https://openai.com/index/finding-gpt4s-mistakes-with-gpt-4/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models OpenAI Inference self-play RL scaling o1

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.