How to Build, Evaluate, and Optimize AI Test Agents: A Practical Guide
This guide walks you through creating AI‑powered test agents, defining success metrics, building evaluation datasets, crafting and refining system prompts with techniques like chain‑of‑thought, XML, few‑shot and concise inputs, and scaling the workflow by splitting agents and managing prompt versions.
Background
Recent weeks saw the addition of speed‑up and attribution agents to the AITest platform. The first versions were built in a day or two, but real‑world evaluation exposed many shortcomings, prompting a systematic approach to agent development and prompt engineering.
Preparation Before Starting
Define Success Criteria
Clarify the agent’s purpose, expected outcomes, required inputs, and output format before any code is written.
What tasks must be performed?
What does a successful result look like?
What inputs are needed?
What should the agent output?
Prepare an Evaluation Dataset
Collect a measurable test set early (spreadsheets, docs, Excel, Langfuse, etc.). A stable, quantifiable dataset enables repeatable testing and objective problem detection.
Example (attribution): Goal – locate the exact failing test step from a report; Output – precise step and clear error reason; Inputs – test report, snapshot, test steps.
Typical evaluation pipeline:
Define the agent’s goal, expected effect, and required inputs.
Gather failing reports together with the expected step and error description.
Implement a first‑version agent that produces the desired output structure.
Run the test set through the agent, record results in a multidimensional table, and generate an initial evaluation report.
Analyze the report for alignment with expectations and identify root causes.
Quickly Initialise Prompt
Use an LLM to draft the initial system prompt by providing the high‑level goal. This speeds up quantitative evaluation compared with hand‑crafting the first prompt.
Prompt Optimisation
Prompt‑Improvement Prompt
Leverage an LLM to iteratively improve prompts instead of manual tweaking, reducing human effort and exploiting machine speed.
Optimisation Prompt Template
When refining a prompt, repeatedly supply the model with four elements:
Current system prompt
Input content
Agent output
Desired optimisation goal
Template (place‑holders shown):
<input>…</input>
<output>…</output>
<system-prompt>…</system-prompt>
<issues>…</issues>Practical Prompt‑Tuning Techniques
Structured Prompts
Goal / Role Definition
State the core objective clearly; a formal role is optional but can help focus the model.
我会给你一个测试步骤执行的结果,以及他执行前后的快照(如果是打开页面的测试步骤可能只有一张快照)我希望你能帮我判断一下测试步骤执行是否符合预期。 你是一个专业的测试用例失败分析助手,负责分析自动化测试执行失败的根因,并输出标准化、结构化的失败摘要与改进建议。输出必须严格遵循约定的 JSON 格式。Chain‑of‑Thought / Task Decomposition
Break the analysis into explicit steps to improve reliability.
1. Iterate over the analysis array in order.
2. For each step where passed=false, compare with the reference summary:
- If behaviour matches historical success, mark as passed.
- Otherwise, mark as the first failure.
3. Extract index, type, simpleFailReason, and summary.
4. Return a JSON object.XML‑Style Prompt Sections
Wrap logical sections in XML‑like tags ( <input>, <output>, <system-prompt>, <issues>) so ordering is irrelevant and partial updates are easy.
Few‑Shot Examples
Provide a small number of illustrative examples when the base prompt reaches ~60‑70 % accuracy but still fails on edge cases.
## 处理示例
### URL 操作处理
// 输入:打开 URL:https://store.youzan.com/v4/channel/xhs/dashboard
// ⚠️ 必须使用 aiAction 类型
{
"flows": [
{
"type": "aiAction",
"prompt": "打开 URL:https://store.youzan.com/v4/channel/xhs/dashboard",
"action": {"type": "url", "url": "https://store.youzan.com/v4/channel/xhs/dashboard", "weappPath": ""}
}
]
}Keep Prompts Concise
Long prompts increase token usage and latency. Balance accuracy gains against response time.
Reduce Few‑Shot
Start with zero few‑shots; add examples only after the base prompt handles the majority of cases. Avoid domain‑specific nouns in examples to keep them general.
Trim Input
Feeding the entire test report overwhelms the model, causing hallucinations, incomplete analysis, and slow responses (>60 s). Supplying only the steps that need optimisation dramatically improves quality and reduces latency to ~6 s.
Split Into Smaller Agents
Decompose a complex task into dedicated agents, each with a narrow responsibility. Example composition for failure attribution:
Classifier : classifies failure error types.
Inspector : performs detailed diff analysis of failing steps.
Analyzer : analyses execution results of test steps.
Coordinator : aggregates diff and execution analysis into a final failure summary.
Manage Prompt Rollback
Each prompt change replaces the previous version entirely; keep a backup file of the last working prompt to revert when negative optimisation occurs.
Accept that agents rarely reach 100 % accuracy; beyond ~70 % acceptance further tweaks often cause regressions. Consider complementary techniques such as ReAct or Retrieval‑Augmented Generation.
N‑time best‑of validation: run the same prompt multiple times and compare outputs; inconsistencies may indicate hallucinations.
Conclusion
Prompt engineering evolves rapidly. Continuous testing, iteration, and optimisation are essential for building reliable agents.
Key takeaway: Good agents emerge from repeated quantitative evaluation and incremental prompt improvements rather than a single perfect design.
References
Define Your Success Criteria – Claude Docs: https://docs.claude.com/zh-CN/docs/test-and-evaluate/define-success
Prompt Improver – Claude Docs: https://docs.claude.com/zh-CN/docs/build-with-claude/prompt-engineering/prompt-improver
System Prompts – Claude Docs: https://docs.claude.com/zh-CN/docs/build-with-claude/prompt-engineering/system-prompts
Chain‑of‑Thought – Claude Docs: https://docs.claude.com/zh-CN/docs/build-with-claude/prompt-engineering/chain-of-thought
Multishot Prompting – Claude Docs: https://docs.claude.com/zh-CN/docs/build-with-claude/prompt-engineering/multishot-prompting
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
