PinchBench: Open‑Source Benchmark for Evaluating LLM‑Powered AI Agents like OpenClaw
PinchBench is an open‑source benchmark that measures the success rate, speed, and cost of large language models when used as the core of AI agents such as OpenClaw across 23 real‑world tasks, providing concrete rankings, usage instructions, and a GitHub repository for developers.
PinchBench Overview
PinchBench is an open‑source AI‑agent benchmark that evaluates large language models (LLMs) used as the core of OpenClaw agents. It runs each model on a fixed suite of real‑world tasks and reports three metrics: Success Rate (percentage of tasks completed successfully), Speed (wall‑clock time per task) and Cost (model inference cost).
Task Suite
The benchmark contains 23 tasks grouped into eight categories. For each category a typical example and the evaluation focus are listed:
Productivity : calendar scheduling, daily summary – focus on time parsing and event creation.
Research : stock price lookup, meeting info, market analysis – focus on web search, data extraction and synthesis.
Writing : blog posts, emails, tone polishing – focus on tone control and formatting.
Programming : weather scripts, project scaffolding – focus on code generation and file operations.
Analysis : Excel processing, PDF summarisation – focus on data handling and document understanding.
Email : inbox triage, search filtering – focus on email management.
Memory : context recall, knowledge management – focus on long‑term memory capability.
Skills : ClawHub skill discovery and integration – focus on OpenClaw ecosystem integration.
Each task is automatically checked by a script and then scored by an LLM judge (Claude Opus).
Quick Start
Requirements: Python 3.10+, the uv package manager, and a running OpenClaw instance.
# Clone the repository
git clone https://github.com/pinchbench/skill.git
cd skill
# Run the benchmark with any model
./scripts/run.sh --model anthropic/claude-sonnet-4
# Run specific tasks
./scripts/run.sh --model openai/gpt-4o --suite task_01_calendar,task_02_stock
# Register results to the public leaderboard (provide your token)
./scripts/run.sh --registerLatest Rankings (as of article date)
Success Rate top three: MiniMax‑m2.1 and kimi‑k2.5.
Speed leader: MiniMax‑m2.5.
Cost‑effective: gpt‑5‑nano (global) and MiniMax‑m2.1 (domestic).
Full results and the live leaderboard are available at https://pinchbench.com/. The source code and benchmark suite are hosted at https://github.com/pinchbench/skill.
Java Tech Enthusiast
Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
