Artificial Intelligence 21 min read

How to Build, Evaluate, and Optimize AI Test Agents: A Practical Guide

This guide walks you through creating AI‑powered test agents, defining success metrics, building evaluation datasets, crafting and refining system prompts with techniques like chain‑of‑thought, XML, few‑shot and concise inputs, and scaling the workflow by splitting agents and managing prompt versions.

Youzan Coder

Nov 21, 2025

How to Build, Evaluate, and Optimize AI Test Agents: A Practical Guide

Background

Recent weeks saw the addition of speed‑up and attribution agents to the AITest platform. The first versions were built in a day or two, but real‑world evaluation exposed many shortcomings, prompting a systematic approach to agent development and prompt engineering.

Preparation Before Starting

Define Success Criteria

Clarify the agent’s purpose, expected outcomes, required inputs, and output format before any code is written.

What tasks must be performed?

What does a successful result look like?

What inputs are needed?

What should the agent output?

Prepare an Evaluation Dataset

Collect a measurable test set early (spreadsheets, docs, Excel, Langfuse, etc.). A stable, quantifiable dataset enables repeatable testing and objective problem detection.

Example (attribution): Goal – locate the exact failing test step from a report; Output – precise step and clear error reason; Inputs – test report, snapshot, test steps.

Typical evaluation pipeline:

Define the agent’s goal, expected effect, and required inputs.

Gather failing reports together with the expected step and error description.

Implement a first‑version agent that produces the desired output structure.

Run the test set through the agent, record results in a multidimensional table, and generate an initial evaluation report.

Analyze the report for alignment with expectations and identify root causes.

Quickly Initialise Prompt

Use an LLM to draft the initial system prompt by providing the high‑level goal. This speeds up quantitative evaluation compared with hand‑crafting the first prompt.

Prompt Optimisation

Prompt‑Improvement Prompt

Leverage an LLM to iteratively improve prompts instead of manual tweaking, reducing human effort and exploiting machine speed.

Optimisation Prompt Template

When refining a prompt, repeatedly supply the model with four elements:

Current system prompt

Input content

Agent output

Desired optimisation goal

Template (place‑holders shown):

<input>…</input>
<output>…</output>
<system-prompt>…</system-prompt>
<issues>…</issues>

Practical Prompt‑Tuning Techniques

Structured Prompts

Goal / Role Definition

State the core objective clearly; a formal role is optional but can help focus the model.

我会给你一个测试步骤执行的结果，以及他执行前后的快照（如果是打开页面的测试步骤可能只有一张快照）我希望你能帮我判断一下测试步骤执行是否符合预期。

你是一个专业的测试用例失败分析助手，负责分析自动化测试执行失败的根因，并输出标准化、结构化的失败摘要与改进建议。输出必须严格遵循约定的 JSON 格式。

Chain‑of‑Thought / Task Decomposition

Break the analysis into explicit steps to improve reliability.

1. Iterate over the analysis array in order.
2. For each step where passed=false, compare with the reference summary:
   - If behaviour matches historical success, mark as passed.
   - Otherwise, mark as the first failure.
3. Extract index, type, simpleFailReason, and summary.
4. Return a JSON object.

XML‑Style Prompt Sections

Wrap logical sections in XML‑like tags ( <input>, <output>, <system-prompt>, <issues>) so ordering is irrelevant and partial updates are easy.

Few‑Shot Examples

Provide a small number of illustrative examples when the base prompt reaches ~60‑70 % accuracy but still fails on edge cases.

## 处理示例
### URL 操作处理
// 输入：打开 URL：https://store.youzan.com/v4/channel/xhs/dashboard
// ⚠️ 必须使用 aiAction 类型
{
  "flows": [
    {
      "type": "aiAction",
      "prompt": "打开 URL：https://store.youzan.com/v4/channel/xhs/dashboard",
      "action": {"type": "url", "url": "https://store.youzan.com/v4/channel/xhs/dashboard", "weappPath": ""}
    }
  ]
}

Keep Prompts Concise

Long prompts increase token usage and latency. Balance accuracy gains against response time.

Optimization before: ~20s response, ~30k tokens

Optimization after: <10s response, ~3k tokens

Reduce Few‑Shot

Start with zero few‑shots; add examples only after the base prompt handles the majority of cases. Avoid domain‑specific nouns in examples to keep them general.

Trim Input

Feeding the entire test report overwhelms the model, causing hallucinations, incomplete analysis, and slow responses (>60 s). Supplying only the steps that need optimisation dramatically improves quality and reduces latency to ~6 s.

Split Into Smaller Agents

Decompose a complex task into dedicated agents, each with a narrow responsibility. Example composition for failure attribution:

Classifier : classifies failure error types.

Inspector : performs detailed diff analysis of failing steps.

Analyzer : analyses execution results of test steps.

Coordinator : aggregates diff and execution analysis into a final failure summary.

Manage Prompt Rollback

Each prompt change replaces the previous version entirely; keep a backup file of the last working prompt to revert when negative optimisation occurs.

Accept that agents rarely reach 100 % accuracy; beyond ~70 % acceptance further tweaks often cause regressions. Consider complementary techniques such as ReAct or Retrieval‑Augmented Generation.

N‑time best‑of validation: run the same prompt multiple times and compare outputs; inconsistencies may indicate hallucinations.

Conclusion

Prompt engineering evolves rapidly. Continuous testing, iteration, and optimisation are essential for building reliable agents.

Key takeaway: Good agents emerge from repeated quantitative evaluation and incremental prompt improvements rather than a single perfect design.

References

Define Your Success Criteria – Claude Docs: https://docs.claude.com/zh-CN/docs/test-and-evaluate/define-success

Prompt Improver – Claude Docs: https://docs.claude.com/zh-CN/docs/build-with-claude/prompt-engineering/prompt-improver

System Prompts – Claude Docs: https://docs.claude.com/zh-CN/docs/build-with-claude/prompt-engineering/system-prompts

Chain‑of‑Thought – Claude Docs: https://docs.claude.com/zh-CN/docs/build-with-claude/prompt-engineering/chain-of-thought

Multishot Prompting – Claude Docs: https://docs.claude.com/zh-CN/docs/build-with-claude/prompt-engineering/multishot-prompting

AI agents LLM prompt engineering evaluation testing automation

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.