Iterating Agent Skills with SkillRevise: Using Execution Traces for Continuous Improvement
SkillRevise tackles the overestimation of LLM‑authored agent skills by breaking down complex search tasks, attaching evidence to verifiable sources, and introducing trace‑conditioned revisions that let engineers pinpoint and fix failures across retrieval, reasoning, and presentation layers.
Paper Basic Information
Title: SkillRevise: Improving LLM‑Authored Agent Skills via Trace‑Conditioned Skill Revision
arXiv ID: 2606.01139
Focus: Agent Skill, Retrieval‑Augmented Generation (RAG), search, agent evaluation and engineering.
Problem Addressed
Agent skills are often judged only by the prettiness of the final answer, ignoring where the task originates, how evidence is verified, whether failures can be located, and whether evaluation reflects real user needs. SkillRevise treats these as hard problems and proposes to iteratively revise skills based on failure traces.
Key Contributions (Three Core Points)
Evaluation targets multi‑step information tasks that require search, reading, comparison, and synthesis rather than single‑turn QA.
Evaluation metrics must go beyond a single overall score; they need to expose coverage, factual reliability, and reasoning organization.
Explainable evaluation is more useful for engineering teams because it reveals where failures occur, guiding targeted fixes.
Method Decomposition
Task Layer: Break Down User Requirements
Real search tasks contain multiple constraints (time, region, object scope, comparison angle, output format, evidence freshness). A practical decomposition follows three steps:
Identify the primary question type (list, explain, compare, predict, decision‑assist).
Extract hard constraints (time window, geographic range, object set, exclusion criteria).
Define checkpoints: which facts must appear and which analyses need evidence support.
Evidence Layer: Bind Answers to Verifiable Materials
SkillRevise emphasizes a "conclusion‑evidence‑source" chain rather than merely feeding documents to the model. The evidence chain is split into four categories:
Original source: web page, paper, announcement, database record.
Evidence snippet: specific text or table supporting a sub‑conclusion.
Processing action: search query, filter, ranking, deduplication logic.
Generated conclusion: the final answer presented to the user.
Missing any layer makes online post‑mortems difficult.
Evaluation Layer: Avoid a Single Aggregate Score
For search agents, a single score can hide critical issues. SkillRevise recommends splitting scores into three dimensions:
Instruction compliance – coverage of user‑specified scope, constraints, and format.
Factual reliability – whether key conclusions are supported by sources.
Reasoning organization – logical consistency of comparison, attribution, and recommendation.
Original Evaluation Comparison
The paper’s evaluation highlights four verifiable facts:
LLM‑authored skills.
Trace‑conditioned revision.
Using execution traces to revise skills.
Focus on sustainable skill improvement.
Key reading guidelines for engineers:
Check whether the evaluation task mirrors real user queries; overly synthetic tasks reduce migration value.
Assess whether rubrics explain *why* a skill fails; coarse rubrics only indicate "bad" while fine‑grained, cascaded rubrics reveal the failure reason.
Inspect trace analysis to locate failures to specific system modules (retrieval, evidence validation, synthesis).
Implications for RAG, Search, and Agent Products
Search agents should prioritize answering completely, providing evidence, and stating limitations over merely sounding expert.
Evaluation sets must be continuously refreshed with hot topics, comments, and real user demands to avoid benchmark contamination.
Rubrics must guide engineering fixes:
Instruction‑compliance issues → improve query understanding, constraint extraction, task planning.
Factual errors → refine source selection, cross‑validation, citation checks.
Reasoning flaws → enhance evidence organization, conflict handling, answer structuring.
User‑preference gaps → adjust output prioritization, summary hierarchy, uncertainty expression.
Engineering Checklist
Sample 50–100 multi‑constraint tasks from real user questions.
Decompose each task into 3–8 sub‑checkpoints.
Label each checkpoint for instruction compliance, factual reliability, and analytical soundness.
Record model search trace, cited sources, and final answer.
Conduct weekly failure‑type retrospectives instead of only tracking overall accuracy.
For high‑risk conclusions, output a "insufficient to judge" compliance flag.
Assign remediation ownership to modules: retrieval, re‑ranking, tool use, generation, or evaluation.
Limitations and Reading Scope
Even a detailed offline evaluation cannot fully replace real user feedback, and LLM‑as‑judge stability requires ongoing verification. The framework works best as part of an offline benchmark complemented by online click data, follow‑up questions, complaints, manual QA, and case reviews.
Core Takeaways
SkillRevise reminds us that the competitive edge of search agents lies not only in model capability but also in task construction, evidence chaining, cascaded evaluation, and failure localization. Teams that can decompose user needs into verifiable subtasks can more readily expose true shortcomings and iterate sustainably.
Turning Evaluation Results into Product Iteration
Archive failure samples: store original query, retrieval terms, adopted sources, and final answer; annotate failure type (missed answer, wrong citation, over‑inference, format mismatch, insufficient evidence).
Split module responsibility: assign query‑understanding to task parsing, recall gaps to retrieval/data sources, factual conflicts to evidence verification, and disorganized answers to answer composition.
Conduct small‑step regression tests: after each prompt, retriever, or tool change, re‑run the fixed failure set and monitor both average scores and the emergence or disappearance of specific error types.
Viewing the paper through this lens turns a benchmark into a diagnostic dashboard for the team.
Cost Considerations for Online Use
Fine‑grained evaluation incurs token, latency, and storage costs because each sub‑task requires a judge, each evidence piece needs verification, and every answer must retain its trace. A tiered approach mitigates cost:
Low‑risk: basic citation and format checks.
Medium‑risk: sub‑task decomposition, factual verification, and failure classification.
High‑risk: full trace preservation, manual sampling, and regression‑sample ingestion.
This layered strategy balances explainability with operational efficiency.
FAQ
Is this evaluation suitable for ordinary RAG? It fits complex queries and research‑style QA; simple FAQs can start with a lightweight rubric.
What should be implemented first? Begin with sub‑task decomposition and evidence‑chain recording; without them, scores are hard to interpret.
Can we rely entirely on automatic judges? Automation scales but key samples still need manual review.
How does this relate to search products? It helps teams distinguish "retrieval missed" vs. "evidence unverified" vs. "answer poorly organized" failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
