Which LLM Wins the Agent Benchmark? PinchBench Success, Speed, and Cost Rankings Revealed
PinchBench evaluates 32 mainstream large language models on success rate, execution speed, and cost for real‑world agent tasks, highlighting top performers like Gemini‑3‑flash‑preview, MiniMax‑M2.1, and Kimi‑K2.5, and explains why traditional AI benchmarks no longer predict agent effectiveness.
OpenClaw founder Peter Steinberger introduced PinchBench, a benchmark that ranks 32 mainstream large language models on three dimensions—success rate, execution speed, and cost—specifically for AI agent workloads.
Choosing a model is the highest‑leverage decision for an agent system.
Success Rate Rankings
google/gemini-3-flash-preview – 95.1% success, ranking first.
minimax-m2.1 – 93.6% success, surpassing Claude Sonnet 4.5 (92.7%) and GPT‑4o (85.2%).
moonshotai/kimi-k2.5 – 93.4% success, third place; two of the top three are Chinese models.
The benchmark measures the percentage of tasks successfully completed in a standardized OpenClaw agent test. Tasks cover practical work such as scheduling meetings, prioritizing emails, writing code, and managing files, rather than trivia or math problems.
Speed Rankings
minimax/minimax-m2.5
google/gemini-2.0-flash
meta-llama/llama-3.1-70b
Cost Rankings
openai/gpt-5-nano
google/gemini-2.5-flash-lite
mistralai/devstral-2512
Overall Assessment
MiniMax‑M2.1 and Kimi‑K2.5 provide the best cost‑performance balance, while the Claude Opus series is comparatively expensive.
GPT‑5‑Nano: Cost #1, Success #9, Speed #16
Gemini‑2.5‑Flash‑Lite: Cost #2, Success #14, Speed #13
MiniMax‑M2.1: Cost #5, Success #2, Speed #22
Kimi‑K2.5: Cost #8, Success #3, Speed #27
Claude Opus 4.6: Cost #20, Success #7, Speed #30
Traditional "intelligence" benchmarks such as MMLU and HumanEval increasingly fail to predict which models can effectively operate as agents; agent tasks demand abilities like multi‑step instruction following, tool invocation, handling ambiguous scenarios, and error recovery, which differ fundamentally from simple question‑answering.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
