WildClawBench: 60 Real-World Agent Tasks Reveal How Far AI “Lobsters” Have Come
WildClawBench, a 60‑question, Docker‑based benchmark from Shanghai AI Lab’s InternLM team, evaluates AI agents across six multimodal categories, exposing low ceilings for top models like Claude Opus 4.6, highlighting cost‑performance trade‑offs and the rapid rise of Chinese models such as GLM 5.
WildClawBench is a comprehensive, open‑source benchmark released by the InternLM team at Shanghai AI Lab to assess the end‑to‑end capabilities of AI agents. It consists of 60 handcrafted tasks that run in isolated Docker containers, providing agents with a realistic OpenClaw environment equipped with a browser, terminal, file system and calendar.
The evaluation framework injects ground‑truth data and scoring scripts only after an agent finishes execution, ensuring that the agent cannot cheat and that no data leakage occurs.
Six task categories (60 questions total):
Productivity (10): e.g., crawling all arXiv papers in cs.CV for a given day, classifying them into six directions, extracting figures/tables, and generating personalized recommendations.
Code Intelligence (12): agents must read undocumented code repositories, install dependencies, and write runnable inference scripts such as a SAM3 inference pipeline.
Social Interaction (6): multi‑turn email negotiation to schedule meetings and extraction of to‑do items from chat logs.
Search Retrieval (11): cross‑checking contradictory online information, tracing original sources, and producing evidence‑based conclusions.
Creative Synthesis (11): end‑to‑end production of a polished PDF report from a product launch video, requiring accurate extraction of specs and aesthetic layout scoring.
Safety Alignment (10): detecting hidden malicious commands in seemingly benign documents and auditing large code histories for leaked API keys.
Each task is executed in its own Docker container, and scores are calculated by automated scripts that compare the agent’s output with the hidden ground truth.
As of 1 April 2026, WildClawBench evaluated 14 frontier models. Claude Opus 4.6 achieved the highest score of 51.6 %, indicating a low ceiling for current agents. Its average cost per run exceeds $80, while GPT‑5.4 costs about $20 and trails by only 1.3 percentage points, showing a stark cost‑performance gap. Chinese models performed strongly: GLM 5 secured third place with 42.6 % at $11.39 per run, and Xiaomi’s MiMo V2 Pro followed at 40.2 %, surpassing Google DeepMind’s Gemini 3.1 Pro.
The benchmark also supports an “OpenClaw personal leaderboard,” allowing users to submit their customized agent workspaces (SOUL.md, MEMORY.md, custom skills) for evaluation on the same 60 tasks, helping the community understand which harnesses, skill combinations, persona settings, and memory strategies truly improve task success.
WildClawBench is released under the MIT license. All task definitions, scoring code, Docker images and the dataset are publicly available on GitHub ( github.com/InternLM/WildClawBench) and HuggingFace, and the project welcomes community contributions of new tasks following the provided template.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
