Junyang Lin’s 10k‑Word Review: From Reasoning to Agentic Thinking in Large Models

In a detailed post‑departure analysis, Junyang Lin reviews two years of large‑model evolution, explains how o1 and DeepSeek‑R1 highlighted the limits of pure reasoning, and argues that the next breakthrough lies in agentic thinking that integrates environment interaction, tool use, and robust reinforcement‑learning infrastructure.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Junyang Lin’s 10k‑Word Review: From Reasoning to Agentic Thinking in Large Models

Former Alibaba Qwen team lead Junyang Lin published a comprehensive 10,000‑word article titled “From ‘Reasoning’ Thinking to ‘Agentic’ Thinking,” reflecting on the past two years of large‑model progress and outlining the next research direction.

1. Lessons from o1 and DeepSeek‑R1

OpenAI’s o1 demonstrated that “thinking” can be a trainable core capability, while DeepSeek‑R1 showed that post‑training reasoning can be reproduced outside the original lab. Both models proved that deterministic, stable, and scalable feedback signals—especially in mathematics, code, and logic—are essential for extending reinforcement learning beyond simple preference supervision.

The emergence of reasoning models shifted the focus from scaling pre‑training to scaling post‑training for reasoning, requiring large‑scale deployment, high‑throughput validation, stable policy updates, and efficient sampling.

2. The Real Challenge: Merging Thinking and Instruction

Qwen 3 introduced a “mixed thinking mode” that supports both “thinking‑type” and “non‑thinking‑type” behaviors, controls a “thinking budget,” and places the fusion step after long‑chain‑of‑thought cold‑start and reasoning‑RL. However, the deeper obstacle is data: the distributions and objectives of the two modes differ fundamentally.

When data for both modes are not carefully curated, the resulting model exhibits diluted thinking behavior and reduced instruction reliability, increasing latency for enterprise users.

3. Anthropic’s Corrective Path

Anthropic positioned Claude 3.7 as a controllable mixed‑reasoning model with an “expanded thinking” mode and a user‑set thinking budget. Claude 4 further integrated tool‑calling into long‑term reasoning, emphasizing that the thinking process should be shaped by the specific workload—code generation, agentic workflows, etc.

4. Defining Agentic Thinking

Agentic thinking changes the optimization goal: instead of measuring the length or depth of internal reasoning, it evaluates whether the model’s thinking supports effective action in an environment. Key capabilities include deciding when to stop thinking, selecting and ordering tools, integrating noisy or partial observations, revising plans after failures, and maintaining logical consistency across multi‑turn interactions.

5. Infrastructure Challenges for Agentic RL

Transitioning from benchmark‑oriented RL to interactive task‑oriented RL demands a new stack: environments become integral components, and training must be cleanly decoupled from inference. Without this decoupling, rollout throughput drops, leading to “training hunger” and inefficient GPU utilization.

Environment quality now rivals data diversity as a core research asset, requiring stability, realism, coverage, difficulty, state diversity, rich feedback, and resistance to exploitation.

6. Future Outlook

Lin predicts that agentic thinking will dominate, replacing static, monologue‑style reasoning. Advanced systems will need permissions for search, simulation, code execution, verification, and revision, while guarding against reward‑hacking and spurious optimization.

The next bottlenecks will focus on environment design, robust evaluators, anti‑cheating protocols, and principled interfaces between policy and world. Ultimately, the competitive edge will shift from model architecture to superior environment engineering, tight training‑service integration, and “harness engineering” that orchestrates multiple specialized agents.

large language modelsmodel evaluationAI infrastructureagentic thinking
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.