Skill Graphs Reveal Why Training Diversity Beats Quantity for Terminal Agents
The paper shows that, instead of increasing the number of training tasks, controlling the diversity of scene‑skill combinations via a large‑scale Skill Graph dramatically improves terminal‑agent performance, with Qwen3‑32B surpassing a 480B model on the Terminal‑Bench 2.0 benchmark.
Why Quantity Misleads Agent Training
Training a command‑line terminal agent intuitively seems to benefit from more practice tasks, but the Tencent Hunyuan team discovered that the decisive factor is not the sheer number of tasks but the variety of scene‑skill combinations the AI experiences.
The Problem with Existing Synthetic Data
Current methods either let large language models generate taxonomies that often diverge from real usage, or reverse‑engineer tasks from GitHub repositories, limiting them to software‑engineering scenarios. Both approaches focus on generating many tasks without ensuring that agents encounter diverse scene × skill pairs.
Core Idea: SkillSynth Skill Graph
SkillSynth abstracts an agent’s operation into a directed graph where each node is a scene (e.g., “video file downloaded but not compressed”) and each edge is a skill (e.g., “compress video with ffmpeg”). The graph contains 82,073 scene nodes , 57,214 skill edges , and 185,529 LLM‑validated bridge relations . Over 85.6% of nodes belong to the largest connected component, meaning most skills can be chained into complete workflows.
Graph Construction Process
The graph is built in five steps: filter skills from ClawHub and GitHub, let an LLM infer pre‑ and post‑scenes for each skill, deduplicate via clustering, align skills by matching a skill’s post‑scene with another’s pre‑scene, and finally merge and filter the results. Sampling uses inverse‑frequency weighting so rarely visited nodes and edges are prioritized, ensuring uniform coverage of the scene‑skill space.
Automatic Task Generation Pipeline
Planner : converts sampled paths into structured sub‑goals and expected outputs.
Constructor : creates full task instances, including commands, filesystem snapshots, container environments, verification scripts, and reference solutions.
Dual Verification : runs a reference solution to confirm solvability and uses an LLM to score alignment between instructions and tests.
If verification fails, the task enters a repair loop of up to 3 rounds , each allowing at most 20 tool calls .
Generation Results
From 3,721 sampled paths the system produced 3,560 verified task instances , achieving a 95.7% oracle pass rate at an average cost of $27.3 per task . The tasks are non‑trivial: Claude Opus 4.6 requires on average 37 steps to solve, and 121 tasks remain unsolved after three attempts.
Experimental Comparison: Diversity > Quantity
Key benchmark scores on Terminal‑Bench 2.0 (higher is better) illustrate the impact:
Qwen3‑8B + single skill: 8.7 % (TB 1.0), 5.3 % (TB 2.0)
Qwen3‑8B + random multi‑skill: 13.4 % (TB 1.0), 11.6 % (TB 2.0)
Qwen3‑8B + SkillSynth: 17.1 % (TB 1.0), 13.5 % (TB 2.0)
Qwen3‑32B + single skill: 25.4 % (TB 1.0), 21.3 % (TB 2.0)
Qwen3‑32B + random multi‑skill: 30.8 % (TB 1.0), 25.8 % (TB 2.0)
Qwen3‑32B + SkillSynth: 33.8 % (TB 1.0), 29.6 % (TB 2.0)
Qwen 3 Coder 480B (no SkillSynth): – , 23.9 % (TB 2.0)
SkillSynth improves over the single‑skill baseline by 8.4 points (TB 1.0) and over the random multi‑skill baseline by 3.0 points . Its trajectories cover 31 % more unique scene‑skill pairs than single‑skill and 19 % more than random multi‑skill.
Ablation Insight
Randomly stitching skills without graph guidance harms workflow coherence : generated tasks contain many fragmented requirements but involve very few actual execution steps.
Practical Impact
The SkillSynth‑generated tasks have already been used to train the Hy3 Preview model, directly boosting agent capability in terminal scenarios. The graph continues to grow as the ClawHub community contributes new skills, now spanning coding, document processing, DevOps, security, audio‑speech, 3D simulation, IoT hardware, and other long‑tail domains.
Takeaway for AI Practitioners
The decisive factor for training effective terminal agents is not model size or task quantity but the diversity of training trajectories; controlling coverage of the scene‑skill space with a skill graph yields substantially higher performance.
Paper title: Toward Scalable Terminal Task Synthesis via Skill Graphs
Paper link: https://arxiv.org/abs/2604.25727v1Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
