Which LLM Wins the Agent Benchmark? PinchBench Success, Speed, and Cost Rankings Revealed

PinchBench evaluates 32 mainstream large language models on success rate, execution speed, and cost for real‑world agent tasks, highlighting top performers like Gemini‑3‑flash‑preview, MiniMax‑M2.1, and Kimi‑K2.5, and explains why traditional AI benchmarks no longer predict agent effectiveness.

PaperAgent
PaperAgent
PaperAgent
Which LLM Wins the Agent Benchmark? PinchBench Success, Speed, and Cost Rankings Revealed

OpenClaw founder Peter Steinberger introduced PinchBench, a benchmark that ranks 32 mainstream large language models on three dimensions—success rate, execution speed, and cost—specifically for AI agent workloads.

Choosing a model is the highest‑leverage decision for an agent system.

Success Rate Rankings

google/gemini-3-flash-preview – 95.1% success, ranking first.

minimax-m2.1 – 93.6% success, surpassing Claude Sonnet 4.5 (92.7%) and GPT‑4o (85.2%).

moonshotai/kimi-k2.5 – 93.4% success, third place; two of the top three are Chinese models.

The benchmark measures the percentage of tasks successfully completed in a standardized OpenClaw agent test. Tasks cover practical work such as scheduling meetings, prioritizing emails, writing code, and managing files, rather than trivia or math problems.

Speed Rankings

minimax/minimax-m2.5

google/gemini-2.0-flash

meta-llama/llama-3.1-70b

Cost Rankings

openai/gpt-5-nano

google/gemini-2.5-flash-lite

mistralai/devstral-2512

Overall Assessment

MiniMax‑M2.1 and Kimi‑K2.5 provide the best cost‑performance balance, while the Claude Opus series is comparatively expensive.

GPT‑5‑Nano: Cost #1, Success #9, Speed #16

Gemini‑2.5‑Flash‑Lite: Cost #2, Success #14, Speed #13

MiniMax‑M2.1: Cost #5, Success #2, Speed #22

Kimi‑K2.5: Cost #8, Success #3, Speed #27

Claude Opus 4.6: Cost #20, Success #7, Speed #30

Success Rate vs. Cost
Success Rate vs. Cost

Traditional "intelligence" benchmarks such as MMLU and HumanEval increasingly fail to predict which models can effectively operate as agents; agent tasks demand abilities like multi‑step instruction following, tool invocation, handling ambiguous scenarios, and error recovery, which differ fundamentally from simple question‑answering.

cost efficiencySuccess rateagent AIOpenClawPinchBenchLLM benchmarkExecution Speed
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.