Why Reinforcement Learning Finally Works: The Second Half of AI
The article argues that AI has entered its second half, where reinforcement learning finally generalizes thanks to large‑scale language pretraining and reasoning, shifting focus from building ever better models to redefining problems, evaluation methods, and real‑world utility.
Background: First half of AI research
The early decades of artificial intelligence focused on developing new training algorithms and model architectures. Breakthroughs such as AlexNet, the Transformer, and GPT‑3 demonstrated that a single methodological advance can improve performance across many downstream tasks. Consequently, research effort was directed toward model‑centric progress, while benchmark tasks served mainly as validation tools.
Emerging solution – the “second half”
Recent work shows that reinforcement learning (RL) can achieve broad generalisation when combined with three key ingredients:
Massive language pre‑training : Large‑scale language models acquire world knowledge and common‑sense priors from internet‑scale text corpora.
Scale of data and compute : Sufficient compute enables the model to capture the statistical structure needed for downstream reasoning.
Reasoning as an action : Embedding explicit reasoning steps into the RL action space allows the agent to use language priors to plan and decide.
From an RL perspective, an agent consists of algorithm , environment , and prior knowledge . Historically, research emphasized algorithmic improvements (e.g., REINFORCE, DQN, PPO) while treating the environment and priors as fixed. OpenAI’s early platforms (Gym, Universe, World of Bits) attempted to turn the entire internet into a single RL environment, but they failed because the agents lacked strong priors. Language pre‑training supplies those priors, enabling fine‑tuning for specialised agents such as WebGPT or ChatGPT.
Why prior knowledge matters
Experiments such as CALM (a language‑model‑based agent for text‑based games) required millions of RL steps and did not transfer to new games, highlighting the gap between human zero‑shot generalisation and current agents. The missing component was a rich prior: a pretrained language model that encodes general knowledge. When this prior is combined with an environment that treats reasoning as an actionable step, agents can generalise across tasks that were previously intractable.
Reasoning as an action
Reasoning does not directly modify the external world, yet it expands the action space dramatically. The ReAct framework demonstrates how to interleave think (reasoning) and act (environment interaction) steps:
while not done:
thought = model.generate("Think: ")
action = model.generate("Act: ")
observation, reward, done = env.step(action)By treating the generated thought as part of the agent’s policy, the model can leverage its language‑model priors to select more informative actions, effectively turning abstract reasoning into concrete behaviour.
Limitations of current evaluation practices
Most existing benchmarks rely on two implicit assumptions:
Evaluations run automatically without human interaction.
Tasks are independent and identically distributed (i.i.d.).
Real‑world utility often requires continuous human‑in‑the‑loop interaction, long‑term memory, and sequential task dependencies. For example, a customer‑service chatbot must maintain context across multiple turns and adapt to user feedback, which static, i.i.d. test suites cannot capture.
Proposed direction for next‑generation evaluation
To measure genuine utility, new evaluation settings should:
Incorporate real humans (e.g., Chatbot Arena ) or high‑fidelity user simulators (e.g., tau‑bench ) to capture interactive dynamics.
Require agents to retain and retrieve information over long horizons, testing long‑term memory and continual learning.
Present non‑i.i.d. task sequences that reflect realistic workflows, such as progressive software‑engineering problems where familiarity with a codebase improves performance.
These benchmarks force researchers to develop components beyond the core RL algorithm—e.g., memory modules, curriculum‑aware policies, or novel ways of integrating language priors—because the “generic solution” (large‑scale language‑model + reasoning‑as‑action) alone will no longer dominate performance.
Implications for AI research
The shift from the first half (model‑centric breakthroughs) to the second half (problem definition and utility‑driven evaluation) changes the research agenda:
Instead of iterating on marginal algorithmic tweaks, researchers should ask what problems are worth solving and how to measure real impact .
Designing realistic environments and priors becomes as important as algorithmic innovation.
Success will be measured by the ability to build products that generate economic value, not merely by incremental benchmark scores.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
