Skill Graphs Reveal Why Training Diversity Beats Quantity for Terminal Agents

The paper shows that, instead of increasing the number of training tasks, controlling the diversity of scene‑skill combinations via a large‑scale Skill Graph dramatically improves terminal‑agent performance, with Qwen3‑32B surpassing a 480B model on the Terminal‑Bench 2.0 benchmark.

PaperAgent
PaperAgent
PaperAgent
Skill Graphs Reveal Why Training Diversity Beats Quantity for Terminal Agents

Why Quantity Misleads Agent Training

Training a command‑line terminal agent intuitively seems to benefit from more practice tasks, but the Tencent Hunyuan team discovered that the decisive factor is not the sheer number of tasks but the variety of scene‑skill combinations the AI experiences.

The Problem with Existing Synthetic Data

Current methods either let large language models generate taxonomies that often diverge from real usage, or reverse‑engineer tasks from GitHub repositories, limiting them to software‑engineering scenarios. Both approaches focus on generating many tasks without ensuring that agents encounter diverse scene × skill pairs.

Core Idea: SkillSynth Skill Graph

SkillSynth abstracts an agent’s operation into a directed graph where each node is a scene (e.g., “video file downloaded but not compressed”) and each edge is a skill (e.g., “compress video with ffmpeg”). The graph contains 82,073 scene nodes , 57,214 skill edges , and 185,529 LLM‑validated bridge relations . Over 85.6% of nodes belong to the largest connected component, meaning most skills can be chained into complete workflows.

Graph Construction Process

The graph is built in five steps: filter skills from ClawHub and GitHub, let an LLM infer pre‑ and post‑scenes for each skill, deduplicate via clustering, align skills by matching a skill’s post‑scene with another’s pre‑scene, and finally merge and filter the results. Sampling uses inverse‑frequency weighting so rarely visited nodes and edges are prioritized, ensuring uniform coverage of the scene‑skill space.

Automatic Task Generation Pipeline

Planner : converts sampled paths into structured sub‑goals and expected outputs.

Constructor : creates full task instances, including commands, filesystem snapshots, container environments, verification scripts, and reference solutions.

Dual Verification : runs a reference solution to confirm solvability and uses an LLM to score alignment between instructions and tests.

If verification fails, the task enters a repair loop of up to 3 rounds , each allowing at most 20 tool calls .

Generation Results

From 3,721 sampled paths the system produced 3,560 verified task instances , achieving a 95.7% oracle pass rate at an average cost of $27.3 per task . The tasks are non‑trivial: Claude Opus 4.6 requires on average 37 steps to solve, and 121 tasks remain unsolved after three attempts.

Experimental Comparison: Diversity > Quantity

Key benchmark scores on Terminal‑Bench 2.0 (higher is better) illustrate the impact:

Qwen3‑8B + single skill: 8.7 % (TB 1.0), 5.3 % (TB 2.0)

Qwen3‑8B + random multi‑skill: 13.4 % (TB 1.0), 11.6 % (TB 2.0)

Qwen3‑8B + SkillSynth: 17.1 % (TB 1.0), 13.5 % (TB 2.0)

Qwen3‑32B + single skill: 25.4 % (TB 1.0), 21.3 % (TB 2.0)

Qwen3‑32B + random multi‑skill: 30.8 % (TB 1.0), 25.8 % (TB 2.0)

Qwen3‑32B + SkillSynth: 33.8 % (TB 1.0), 29.6 % (TB 2.0)

Qwen 3 Coder 480B (no SkillSynth): – , 23.9 % (TB 2.0)

SkillSynth improves over the single‑skill baseline by 8.4 points (TB 1.0) and over the random multi‑skill baseline by 3.0 points . Its trajectories cover 31 % more unique scene‑skill pairs than single‑skill and 19 % more than random multi‑skill.

Ablation Insight

Randomly stitching skills without graph guidance harms workflow coherence : generated tasks contain many fragmented requirements but involve very few actual execution steps.

Practical Impact

The SkillSynth‑generated tasks have already been used to train the Hy3 Preview model, directly boosting agent capability in terminal scenarios. The graph continues to grow as the ClawHub community contributes new skills, now spanning coding, document processing, DevOps, security, audio‑speech, 3D simulation, IoT hardware, and other long‑tail domains.

Takeaway for AI Practitioners

The decisive factor for training effective terminal agents is not model size or task quantity but the diversity of training trajectories; controlling coverage of the scene‑skill space with a skill graph yields substantially higher performance.

Paper title: Toward Scalable Terminal Task Synthesis via Skill Graphs
Paper link: https://arxiv.org/abs/2604.25727v1
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMQwen3Skill GraphsSkillSynthTerminal AgentsTraining Diversity
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.