From Reasoning to Agentic Thinking: How Harnesses Are Redefining AI Development
The article examines the shift from traditional reasoning‑based large‑language‑model pipelines to agentic, harness‑driven AI systems, outlining the definition of a harness, its engineering challenges, architectural components, and the broader implications for training, reinforcement learning, and future research directions.
After leaving his previous role, Lin Junyang published a comprehensive essay titled “From ‘Reasoning’ Thinking to ‘Agentic’ Thinking,” which argues that the next paradigm shift in AI moves from pure reasoning models to a combined Model+Harness approach. In this new paradigm, the harness acts as the operating system for agents, determining whether a model can sustain progress through closed‑loop interaction with the real world.
The harness is a complete toolchain and execution environment surrounding an agent, comprising tool servers, browsers, terminals, search engines, simulators, sandboxed execution, API layers, memory systems, and orchestration frameworks. It is no longer a static validator but an integral part of the training system.
Key Elements of the Harness
1. Definition and Composition : The harness wraps the agent with a full suite of tools and execution environments, turning it into an organic component of the training system rather than a mere post‑hoc validator.
2. Paradigm Shift : In agent‑centric reinforcement learning (RL), the policy is embedded within a larger harness. Training and inference must be cleanly decoupled; otherwise, tool latency and environment feedback delays cause rollout throughput to collapse and GPU utilization to plummet.
3. Core Competitive Barriers stem from the transition from “training models” to “training systems.” The decisive factors are:
Better environment design (stability, realism, coverage, anti‑cheating).
Stronger harness engineering capabilities.
Tighter train‑serve integration.
Agentic Architecture
The core intelligence increasingly resides in a multi‑agent collaboration structure within the harness:
Orchestrator : Plans and routes work.
Specialist Agents : Act as domain experts.
Sub‑Agents : Execute narrow tasks, manage context, avoid contamination, and keep reasoning layers separate.
Background: From Reasoning to Agentic Thinking
Recent breakthroughs such as OpenAI’s o1 and DeepSeek‑R1 have reshaped how we evaluate models. Both demonstrate that “thinking” can be treated as a first‑class capability, trainable via reinforcement learning and exposed to users. However, the next focus (mid‑2025) is on “reasoning‑style thinking,” i.e., increasing inference compute, stronger reward signals, and exposing or controlling extra reasoning effort.
The author predicts a further transition to “agentic thinking,” where models think to act, continuously updating plans based on environmental feedback.
1. What o1 and R1 Taught Us
The first wave of reasoning models taught that scaling RL for language models requires deterministic, stable, and scalable feedback signals. Domains such as mathematics, code, and logic become central because their rewards are far stronger than generic preference supervision. Infrastructure—large‑scale rollouts, high‑throughput validation, stable policy updates, and efficient sampling—becomes critical.
Once models are trained to perform long‑horizon reasoning, RL ceases to be a lightweight add‑on and becomes a full system problem.
2. The Challenge Is More Than Merging Thinking and Instruction
Teams aim to unify thinking and instruction modes, allowing adjustable reasoning intensity (low/medium/high) and automatic inference of required reasoning depth. While conceptually sound, the data challenge is severe: the two modes have fundamentally different data distributions and behavioral goals.
In practice, merging often leads to mediocre performance in both directions—thinking becomes noisy or indecisive, while instruction loses crispness and becomes more costly.
3. Why Anthropic’s Direction Is a Beneficial Correction
Anthropic’s Claude models emphasize integrated reasoning, user‑controlled thinking budgets, and tool‑augmented extended reasoning. Their approach treats reasoning as an integrated capability rather than a separate model, aligning the system with specific downstream tasks such as programming or agentic workflows.
4. What “Agentic Thinking” Actually Means
Agentic thinking shifts the optimization goal from producing the best answer to maintaining effective action over time. It requires the model to decide when to stop thinking and act, choose and sequence tool calls, integrate noisy observations, correct plans after failures, and maintain coherence across many tool invocations.
5. Why Agentic RL Infrastructure Is Harder
When the objective moves from benchmark problems to interactive tasks, the RL stack must accommodate a harness that includes tool servers, browsers, terminals, search engines, simulators, sandboxes, APIs, memory, and orchestration. Training and inference must be cleanly decoupled; otherwise rollout throughput collapses due to tool latency and partial observability.
Environment quality becomes a first‑class research product: stability, realism, coverage, difficulty, state diversity, feedback richness, and anti‑exploitation measures are now essential for scaling agentic RL.
6. The Next Frontier: More Usable Thinking
The author expects agentic thinking to become the dominant form of reasoning. The biggest obstacle is “reward hacking”: once models gain meaningful tool access, they may learn to cheat by searching for answers or exploiting future information. Future research bottlenecks will involve robust environment design, evaluator resilience, anti‑cheat protocols, and principled interfaces between policy and world.
Ultimately, the shift from training models to training agents—and from agents to training systems—will redefine competitive advantage, moving it from model architecture and data to harness engineering, environment design, and closed‑loop decision making.
https://x.com/JustinLin610/status/2037116325210829168How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
