How Cloud Agent Harness Grows Skills from Real Tasks: A Three‑Stage Self‑Evolution Mechanism
The article analyzes Huawei Cloud Agent Harness's three‑stage skill self‑evolution framework, detailing how agents automatically extract, evolve, and validate reusable skills from execution traces to overcome manual authoring bottlenecks and ensure continuous improvement.
Agent capability is largely determined by the quantity and quality of its Skills, yet manual authoring cannot keep pace with rapidly changing business scenarios, causing valuable low‑frequency operational knowledge to be lost in logs.
Huawei Cloud Agent technology addresses this by integrating three technical streams of Skill self‑evolution: trajectory distillation, evolution iteration, and evaluation‑driven engineering optimization, each focusing on different data sources, optimization mechanisms, and quality controls.
Trajectory distillation treats an agent's historical execution trace as the foundation for Skill growth. Representative systems such as Hermes Agent , GenericAgent , and SkillX extract successful steps, failure traps, and correction processes into structured Skill files. Hermes Agent generates a SKILL.md via a Skill_manage tool and patches invalid descriptions; GenericAgent internalizes the path as an atomic callable Skill; SkillX abstracts the trace into planning, function, and atomic layers, enabling even weak agents to benefit from strong agents' distilled output. This approach yields high data efficiency, interpretability, and low runtime cost, but its effectiveness depends on trace coverage and lacks proactive exploration of unknown domains.
Evolution iteration models Skill self‑improvement as a mutation‑evaluation‑selection loop driven by LLM‑based semantic mutation. Candidates are maintained in a population, mutated to produce new individuals, evaluated on a test set, and filtered via elitism or Pareto front strategies. Examples include Hermes Agent’s GEPA that treats prompts as evolution individuals, EvoSkill which mutates Skill folders only on task failure, HyperAgents that evolve the entire agent code and meta‑agent, and the Harness Evolution Loop that meta‑optimizes the inner evolution process. This method offers strong exploratory ability and cross‑task generalization but incurs high computational cost and raises safety‑control concerns.
Evaluation‑driven engineering optimization treats Skill and prompt as testable, versioned software modules. Pre‑defined test cases and metrics are run for each Skill change in an isolated context; differences in output or prompt space guide improvements, and the resulting version is auditable. Frameworks such as DSPy compile prompt optimization into declarative problems, while Anthropic Skill Creator 2.0 provides a visual evaluation UI. This approach ensures high quality and compliance, yet its autonomy is limited by the need for human‑defined evaluation criteria.
Huawei Cloud Agent Harness combines the three streams into a three‑stage mechanism: Task Reflection Engine , Skill Storage , Evolution Factory , and Evaluation Pipeline .
Stage 1 – Task Reflection : After a task completes, the full execution trajectory—including user request, each tool call, inputs/outputs, branching decisions, errors, and manual interventions—is serialized. An LLM analyzes the trace, extracts essential steps, discards failed branches, and produces a concise Skill document. If the task modifies an existing Skill, only the relevant paragraph is patched, preserving original experience while avoiding full rewrites.
Stage 2 – Evolution Factory operates in a low‑load window and consists of three steps:
Selection : Skills with high recent failure rates are prioritized; long‑unused but subscribed Skills are secondary; stable Skills are processed last.
Mutation : For each selected Skill, recent failure logs are analyzed to generate targeted modifications—adjusting wording, step order, or parameter suggestions—while preserving the original intent. The number of mutants varies with call frequency and failure count.
Screening : All mutants are fed into the evaluation pipeline for quality inspection.
Stage 3 – Evaluation Pipeline performs three layers of checks:
Admission : Format validation, security scanning, and minimal functional tests ensure the Skill can be loaded and executed safely.
Effect Evaluation : In a sandbox, each candidate runs a suite of test cases covering normal, edge, adversarial, and high‑value regression scenarios derived from the Skill’s description and execution logs. Metrics such as accuracy, latency, token consumption, and safety compliance are recorded.
Target Screening : Candidates are compared against the original Skill; a mutant is retained if it is not worse on any metric and strictly better on at least one, while maintaining diversity. Optional human review can be enabled for final approval.
Future directions highlight a growing ecosystem of open‑source and academic projects (e.g., Fudan University’s GenericAgent with a 3.3 K‑line core, SkillForge in cloud‑tech scenarios). The field is shifting from “can we generate a Skill?” to “can we generate a good Skill?” Emphasis will move toward robust evaluation, high‑quality memory management, token budgeting, and mandatory Skill safety certification to mitigate the uncertainties of autonomous evolution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Huawei Cloud Developer Alliance
The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
