Building the First Real‑World CLI Workflow Benchmark from 80K Human Terminal Recordings
TerminalWorld leverages over 80,000 developer‑recorded terminal sessions to automatically generate 1,530 verified CLI tasks across 18 workflow categories, and its evaluation of leading LLMs and agent frameworks reveals modest success rates, capability gaps, and the shortcomings of expert‑crafted benchmarks.
Background
AI agents are increasingly able to write code and fix bugs, but real software development also involves environment setup, dependency management, deployment, container orchestration, cloud resource handling and security policies, most of which happen in the terminal.
Motivation
Existing terminal‑agent benchmarks rely on expert‑crafted questions, which are often artificial and become outdated as tools evolve. The authors propose using the abundant human terminal recordings from asciinema as a natural source of evaluation tasks.
Data collection
From the public asciinema platform the team collected 80,870 recordings contributed voluntarily by developers. After privacy filtering, CLI‑only selection, reproducibility and duration checks, and quality scoring by a large model, 9,492 high‑quality recordings remained.
Benchmark construction pipeline
The pipeline consists of four stages:
Step 1 – Collection & filtering : privacy sanitisation, removal of GUI‑based sessions, ensuring reproducibility, and model‑based quality scoring.
Step 2 – Task synthesis : using the transcript of each recording, a large‑model agent extracts a concise task description (goal only) and a clean reference solution.
Step 3 – Environment recreation : another agent infers required dependencies, builds a Docker image, runs the reference solution, and iteratively fixes failures until the script executes successfully, yielding reproducible environments for 5,035 tasks.
Step 4 – Test generation : snapshots of the file system before and after execution are taken to create tests. Three validation checks are applied – AllPassing, Nop, and Partial – and only tasks passing all three are kept, resulting in 1,530 automatically verified tasks.
Benchmark characteristics
The final benchmark covers 18 real‑world workflow categories (system management, container orchestration, cloud infrastructure, CI/CD, security, etc.) and 1,280 distinct command‑line tools, 91 % of which are absent from the older Terminal‑Bench. Task complexity ranges from a few commands to workflows exceeding 50 steps.
A curated “Verified” subset of 200 high‑quality tasks is provided for rigorous evaluation of cutting‑edge models.
Evaluation results
The authors evaluated eight state‑of‑the‑art large language models and six mainstream agent frameworks on the Verified subset. Key findings:
Finding 1 – Modest overall performance : success rates span 49 %–62.5 % (average 54.8 %). Even the best model, Claude Opus 4.7, fails on more than one‑third of tasks. Open‑source models such as Kimi K2.6 and GLM 5.1 achieve comparable or better performance at a fraction of the cost.
Finding 2 – More compute does not help : higher token usage and more reasoning steps correlate negatively with success; failed attempts consume 3.3× tokens and 1.4× time while representing only 43 % of attempts but 63 % of total cost.
Finding 3 – Capability gaps : models excel at environment configuration (87.5 % success) and build/testing (78.1 %) but perform poorly on performance optimisation (28.1 %), script automation (39.1 %) and debugging (39.3 %). No model is universally strong across all categories.
Finding 4 – Expert benchmarks overestimate ability : correlation between scores on Terminal‑Bench and TerminalWorld is only 0.20; models that score 57 %–82.7 % on Terminal‑Bench drop to 49 %–62.5 % on TerminalWorld, reshuffling rankings.
Finding 5 – Divergent solution paths : median overlap between agent‑generated command sequences and human‑recorded ones is 21.4 %. Agents often take completely different commands to achieve the same goal, yet still pass the verification tests.
Conclusion
TerminalWorld demonstrates that real‑world terminal recordings provide a rich, continuously growing source of evaluation tasks. The benchmark reveals substantial gaps in current AI agents’ ability to handle authentic software‑engineering workflows, highlighting the need for more realistic, up‑to‑date testing grounds.
All code, data and the benchmark are open‑source, encouraging community contributions to keep the benchmark alive.
Reference: Zhaoyang Chu et al., “TerminalWorld: Benchmarking Agents on Real‑World Terminal Tasks”, arXiv:2605.22535 (2026).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
