Can Scaling Reinforcement Learning Turn AI Models into Real Thinkers? Insights from Dan Roberts' AI Ascent Talk
In a recent AI Ascent presentation, OpenAI researcher Dan Roberts explained how scaling laws for both pre‑training and reinforcement learning reveal a new test‑time dimension of model performance, showcased the capabilities of the o1 and o3 models, and outlined a massive compute‑scaling strategy aimed at creating AI systems that can reason for years like Einstein.
Scaling Laws for Training and Test‑Time Compute
Recent experiments show that model performance improves not only with increased training compute (training time or FLOPs) but also with additional compute allocated at inference time. This introduces a new scaling dimension—"thinking time"—where longer test‑time computation yields higher accuracy on mathematical reasoning benchmarks.
Model Milestones: o1 and o3
OpenAI released the o1 model (Sept 2023), the first to demonstrate the test‑time scaling effect. The subsequent o3 model extends this capability: when given a sketch of a quantum electrodynamics (QED) problem, o3 solves textbook‑level physics calculations in roughly one minute, outperforming earlier models such as GPT‑4.5.
Einstein‑1907 Thought Experiment
To illustrate the new "thinking" ability, a prompt was constructed that mimics a 1907‑era Einstein exam question on general relativity. The model variant named Einstein‑v1907‑super‑hacks produced the correct answer within a minute, whereas GPT‑4.5 failed.
Reinforcement‑Learning (RL) as the Primary Driver
Analysis of training pipelines indicates that RL compute is the dominant factor behind recent performance gains. Earlier systems (e.g., GPT‑4o) relied almost entirely on pre‑training, while o1 and o3 incorporate a substantial RL component. The projection is that RL compute will eventually dominate training, with the "RL cherry" growing larger than the pre‑training "cake."
Scaling Strategy and Projections
OpenAI plans to expand compute infrastructure dramatically (e.g., multi‑billion‑dollar investments, new data‑center campuses in Texas).
Scaling‑science research will focus on quantifying how both training and test‑time compute affect performance.
Empirical trends suggest that the length of tasks AI agents can handle doubles approximately every seven months.
If the trend continues, models could sustain multi‑hour reasoning by 2034, potentially reaching eight‑year continuous "thinking"—a timescale comparable to Einstein’s development of general relativity.
Implications
The combined scaling of RL training compute and test‑time compute points toward AI systems capable of extended, scientific‑level reasoning. Continued investment in both compute resources and scaling‑law research is presented as the most promising path to models that can discover new science.
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
