Why AI Agents Fail: 70% Failure Rate & How Interleaved Thinking Improves Reliability
Recent CMU and Salesforce studies reveal that top‑tier AI agents like Gemini 2.5 Pro, Claude 3.7 Sonnet and GPT‑4o fail in 69‑70% of multi‑step tasks, but MiniMax‑M2’s Interleaved Thinking reduces failure dramatically, highlighting that execution mechanisms, not model size, are key to reliable AI agents.
Agent Failure Rates
Joint research by Carnegie Mellon University (CMU) and Salesforce shows that in realistic office‑scenario tests, top models such as Gemini 2.5 Pro, Claude 3.7 Sonnet and GPT‑4o have an agent failure rate of 69%‑70% on multi‑step tasks. On CRM‑specific professional tasks the success rate drops to only 55%, with only 2.5 successful completions out of 10 attempts for six consecutive tasks.
MiniMax‑M2 Experiments
MiniMax conducted a strict controlled experiment. Keeping the reasoning state (the M2 model) improved performance on the Tau² (tool‑use) benchmark by 35.9% and on the BrowseComp (web‑browsing) benchmark by 40.1% compared with a version that discards the state.
MiniMax engineering lead Skyler Miao noted that Anthropic introduced Interleaved Thinking five months earlier, but community support remains limited because the OpenAI Chat Completion API does not return reasoning content.
MiniMax‑M2 is the first open‑source model that fully supports Interleaved Thinking, costing only about 8% of Claude.
One week after release, M2 ranked among the top‑3 models on the OpenRouter platform.
Execution Mechanism Defects
Agent tasks differ from ordinary Q&A: they involve multi‑step tool‑call chains. A typical workflow (e.g., researching MiniMax‑M2 and writing a technical report) requires searching documentation, community discussions, media reports, and code examples, then synthesizing the information. Each step depends on uncertain tool outputs.
The core issue is handling tool‑call failures. The ideal flow is:
Search official docs → success
Search community discussions → success
Search media reports → success
Generate report → success
In reality, a failure (e.g., community search returns irrelevant results) forces the agent to either continue blindly (producing an incomplete report) or adjust the strategy by changing keywords and retrying, which mirrors human “plan → act → reflect” cycles.
Traditional agents follow a “plan → act → act → act” pattern and lack reflection.
Thinking Paradigms
Human problem solving uses a “plan → act → reflect” loop. This concept was formalized as Chain of Thought (CoT) by Google in 2022, where the model thinks before answering. Extended Thinking (2024) allocates many tokens for deep reasoning before acting, but still follows a “think‑then‑act” sequence.
Interleaved Thinking, introduced with Claude Sonnet 4 (May 2025), allows the model to output both a thinking block (the reasoning process) and a text block (the final answer) in the same response, and to send the entire content back in the next turn, enabling dynamic adjustment and self‑correction.
{
"content": [
{"type": "thinking", "thinking": "User asked me to search community discussions, but the results were poor. I need to change the keyword..."},
{"type": "text", "text": "I found more relevant discussions after adjusting the query..."}
]
}This interleaving of thinking and acting is analogous to AlphaGo’s Monte‑Carlo Tree Search, which re‑evaluates the plan after each move.
OpenAI API Design Flaw
The OpenAI Chat Completion API only allows a single content string in the assistant message, so reasoning steps cannot be returned to the model. By contrast, Anthropic’s Messages API accepts an array of blocks, enabling thinking and text blocks to be transmitted.
{
"model": "gpt-4o",
"messages": [
{"role": "user", "content": "Help me search MiniMax‑M2"},
{"role": "assistant", "content": "I found the information..."},
{"role": "user", "content": "Continue analysis"}
]
} {
"model": "claude-sonnet-4-20250514",
"messages": [
{"role": "user", "content": "Help me search MiniMax‑M2"},
{"role": "assistant", "content": [
{"type": "thinking", "thinking": "I will first search official docs, then community discussions..."},
{"type": "text", "text": "Here are the relevant results..."}
]}
]
}Because millions of applications rely on the OpenAI API, its incompatibility hampers the adoption of Interleaved Thinking.
MiniMax‑M2 Ecosystem Push
MiniMax offers two APIs:
OpenAI‑compatible API adds a reasoning_details field that returns the full reasoning blocks, preserving backward compatibility.
Anthropic‑compatible API uses the native block format, allowing direct use of Claude‑style Interleaved Thinking.
MiniMax‑M2 (230 B parameters, 10 B activation, open‑source) achieves about 90% of Claude Sonnet 4’s performance at only 8% of the cost. Within a week of release it entered the top‑3 models on OpenRouter.
Benchmark Results (M2 vs. Baseline)
Benchmark Keep State Drop State Absolute ↑ Relative ↑ Task Type
SWE‑Bench Verified 69.4 67.2 +2.2 +3.3% Software Engineering
Tau² (Tool Use) 87 64 +23 +35.9% Tool Use
BrowseComp (Web) 44.0 31.4 +12.6 +40.1% Web Browsing
GAIA 75.7 67.9 +7.8 +11.5% General Agent
xBench 72.0 66.0 +6.0 +9.1% Comprehensive AbilityThe gains are not due to larger parameters or more data, but to the mechanism of retaining reasoning state. Tasks that require strategy adjustments benefit the most.
Developer Recommendations
For the best performance, use Claude via Anthropic’s API with thinking mode (higher cost).
For a cost‑effective open‑source option, choose MiniMax‑M2, which supports both OpenAI‑compatible and Anthropic‑compatible endpoints.
Avoid the standard OpenAI Chat Completion API for Interleaved Thinking; use MiniMax’s OpenAI‑compatible endpoint that includes reasoning_details or the Anthropic‑compatible endpoint.
Always transmit the full content (both thinking and text blocks) to preserve the reasoning chain.
Expect Interleaved Thinking to become the default for agents within the next 1‑2 years, with more models (Gemini, Llama, etc.) adopting it.
Mechanism Over Scale
Increasing model size improves raw intelligence but does not guarantee reliability. The MiniMax‑M2 experiments demonstrate that simply changing the execution mechanism can boost performance by up to 40% on the same model, underscoring that better design beats larger parameters.
References
CMU TheAgentCompany Benchmark: https://superface.ai/blog/agent-reality-gap
Salesforce CRM AI Research: https://arxiv.org/abs/2411.02305
MiniMax‑M2 Official Blog: https://www.minimax.io/news/minimax-m2
Anthropic Extended Thinking: https://docs.anthropic.com/en/docs/build-with-claude/extended-thinking
Simon Willison Blog: https://simonwillison.net/2025/Oct/29/minimax-m2/
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
