Why Future AI Agents Must Evolve Beyond Prompt‑Driven Workflows

The article argues that the next generation of AI agents should focus on improving the model itself through reinforcement learning and reasoning rather than relying on pre‑designed prompt‑driven workflows, highlighting industry trends, technical challenges, and the shift toward treating models as products.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Why Future AI Agents Must Evolve Beyond Prompt‑Driven Workflows

Model as Product

The author argues that future progress in AI agents will come from improving the underlying model rather than building more elaborate workflow pipelines. Prompt‑driven agents such as Manus can achieve short‑term results but hit hard limits on long‑term planning, multi‑step reasoning, and context retention.

Next‑Generation Model Forms

Recent work combines reinforcement learning (RL) with explicit reasoning to create agents that can autonomously plan, search, and select tools without external prompts. Two concrete examples are:

OpenAI DeepResearch : a purpose‑built research‑language model trained from scratch to perform end‑to‑end web‑search, document synthesis, and report generation. It learns browsing actions (click, scroll, query) via RL and produces long, structured outputs with traceable reasoning steps.

Anthropic Claude Sonnet 3.7 : defined as an “agent” that dynamically decides its execution flow and tool usage, handling complex programming and reasoning tasks without a fixed workflow.

Key Observations

Scaling general‑purpose models shows diminishing returns; GPT‑4.5 demonstrated linear capability gains while compute costs grew exponentially.

Opinionated training (RL + reasoning) yields outsized performance gains, enabling tiny models to excel at math, coding, or even game playing (e.g., Pokémon).

Inference costs are falling rapidly, making high‑throughput token usage economically viable.

Implications of the Shift

Complexity moves from deployment to training: models must be trained to handle a wide range of actions, edge cases, and tool integrations. Value creation will be captured by model providers, and the traditional API economy is expected to collapse within 2‑3 years as closed‑source vendors sell models directly rather than offering hosted APIs.

DeepResearch Technical Details

DeepResearch is not a standard LLM wrapper; it is a new research‑language model (see OpenAI system card: https://cdn.openai.com/deep-research-system-card.pdf). It learns core browsing primitives through RL, enabling it to:

Interpret a user query and decompose it into sub‑tasks.

Ask clarifying questions when the intent is ambiguous.

Choose between generic web search or specialized API calls.

Iteratively browse, refine queries, and discard unproductive paths.

Log each step, providing a degree of explainability.

Compared to “DeepSearch” products from other vendors (e.g., Perplexity, Google), DeepResearch’s performance gains stem from genuine model‑level training rather than minor fine‑tuning tricks.

Anthropic Agent Definition

Anthropic defines an agent as a system that “dynamically decides its execution flow and tool usage, fully controlling task completion.” This contrasts with many so‑called agent companies that merely stitch together predefined code paths (workflows) linking LLMs to external tools.

Training Challenges and Scaling RL‑Based Agents

Public datasets for search‑oriented agents are scarce; most open data focus on mathematics. To scale, researchers propose:

Simulated environments and synthetic data pipelines that generate browsing trajectories.

Reward functions based on GRPO (Goal‑oriented Reward‑Policy Optimization) and rubric engineering.

Massive parallel exploration (e.g., 16 concurrent trajectories per GPU) leading to billions of simulated web requests, shifting the bottleneck from compute to bandwidth.

Real‑World Agent Search Process

User asks a question; the agent decomposes and infers intent.

If ambiguous, the agent asks clarifying questions.

The agent decides whether to perform a generic search or invoke a specialized API.

It iteratively browses, refines queries, and discards dead ends.

All actions are logged, providing traceability.

This pipeline eliminates the need for separate data preprocessing stages.

Broader Applications

The same architecture can be applied to network configuration automation, log troubleshooting, or financial data standard conversion (e.g., ISO 20022 ↔ MT103). Only a few large labs currently possess the data and infrastructure to build such agents, creating a concentration of capability.

The “Bitter Lesson”

Hard‑coding knowledge (prompt engineering, rule‑based systems) yields short‑term gains but caps long‑term performance. The lasting breakthrough comes from massive computation and learning, not from manually crafted heuristics.

Scaling RL‑Based Agents – Practical Pipeline

A feasible training pipeline for a search‑oriented agent could be:

Create a fixed‑size simulated web environment using a large corpus (e.g., Common Crawl) that is served as on‑the‑fly webpages.

Pre‑train with supervised fine‑tuning (SFT) on existing search logs or synthetic query‑answer pairs to “warm‑up” the model.

Design a set of verification functions (verifiers) for the target tasks (e.g., factual correctness, code execution). Open‑source verifier libraries such as William Brown’s Verifier can be adapted.

Apply GRPO or similar RL algorithms, running many parallel trajectories (e.g., 16 per GPU) where each step may involve up to 100 simulated page loads.

After the model learns effective search policies, perform a second SFT stage focused on high‑quality final report generation, using rubric‑engineered data splits.

Market Dynamics

Key signals:

Early “model‑as‑product” examples: Claude Code, DeepSearch (no public API).

Application‑layer companies (Cursor, WindSurf, Perplexity) are secretly training their own small models to avoid displacement.

Wrapper firms face a binary choice: develop proprietary models or become obsolete.

Capital markets currently undervalue RL‑driven model training, despite its potential to dramatically reduce training costs and open new verticals.

Conclusion

To democratize agent development, the community should publish:

Open‑source verification tools for RL rewards.

Public GRPO‑style datasets and reward specifications.

Reusable simulation environments that emulate web browsing at scale.

These resources would enable researchers beyond the handful of large labs to build truly autonomous agents that operate without reliance on brittle prompt‑driven workflows.

LLMDeepSearchmodel as product
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.