Iterative Agent Evaluation Skill: Automating Bad‑Case Diagnosis with AI Pre‑Annotation

The article presents an end‑to‑end, eight‑phase automated evaluation pipeline for large‑model agents that replaces manual bad‑case inspection with AI‑assisted pre‑annotation, cutting analysis time from a full‑day to about 30 minutes and achieving over 90 % efficiency gain while enabling iterative knowledge‑base refinement.

AntData
AntData
AntData
Iterative Agent Evaluation Skill: Automating Bad‑Case Diagnosis with AI Pre‑Annotation

Background and Challenges

The rapid evolution of large models and agentic architectures shifts the focus from feasibility to preventing data contamination and ensuring stable delivery in real‑world scenarios. Traditional fully manual evaluation suffers from severe efficiency bottlenecks and cognitive bias, requiring at least one person‑day to analyse 104 test cases and produce a report.

Eight‑Phase Automated Evaluation Pipeline

The proposed solution wraps the entire evaluation workflow into a Skill that orchestrates eight distinct phases, transforming the process from "manual inspection + expert review" to "AI pre‑annotation + expert verification".

Phase 1: Config loading and environment check
Phase 2: Corpus preparation and data preprocessing
Phase 3: Batch evaluation submission
Phase 4: Intelligent progress monitoring (ETA estimation, deadlock detection, auto‑recovery)
Phase 5: Automatic retry of failed entries
Phase 6: Parallel AI analysis of non‑perfect cases (multi‑sub‑agent cooperation)
Phase 7: Result aggregation, validation and report generation
Phase 8: Documentation and archiving

Three‑Layer Decoupled Architecture

The Skill follows a three‑layer design that separates control, configuration and execution. Deterministic engineering tasks reside in the scripts/ directory, while high‑level state transitions and boundary definitions are declared in SKILL.md. This clear separation eliminates logical coupling and enables safe insertion or removal of sub‑agents at any stage without risking pipeline integrity.

eval_skill/
├── SKILL.md      # control layer: defines state transitions for Phases 1‑8
├── config.yaml   # configuration layer: user ID, API keys, corpus paths, etc.
└── scripts/      # execution layer: atomic Python scripts & sub‑agent logic
    ├── check_dependencies.py   # Phase 1
    ├── split_batches.py        # Phase 2
    ├── monitor_batch.py        # Phase 4
    └── ...

Phases 1‑5: Deterministic Pre‑processing

Phase 1: Execute check_dependencies.py to abort early on missing dependencies or configuration errors.

Phase 2: Pull the latest test corpus from the designated branch and format it.

Phase 3: Assemble request parameters and submit batch evaluation jobs to the evaluation API.

Phase 4: Use monitor_batch.py for ETA estimation, deadlock detection and automatic recovery.

Phase 5: Automatically retry items that failed due to non‑logical reasons (e.g., network timeout) with an exponential back‑off strategy.

Phase 6: Parallel AI‑Driven Bad‑Case Attribution

This phase bridges deterministic scheduling with large‑model semantic reasoning. To avoid context overflow and hallucination, the pipeline isolates each sub‑agent in a sandbox and controls token consumption.

Scheduling layer: prepare_eval_data.py splits abnormal data into static JSON packages such as prepared_group_N.json and launches independent sub‑agents for each slice.

Environment layer: Sub‑agents run in a stateless sandbox with read‑only access to the JSON files, preventing hallucinated output from contaminating the main system.

Logic layer: Prompts inject baseline rules; output is forced into three qualitative categories: Agent logic defect, Corpus annotation bias, Evaluation‑pipeline anomaly.

Phase 7: Aggregation and Human Review

Results from all sub‑agents are schema‑validated, merged, and transformed into a structured Excel report that serves as a static snapshot. Experts then review the AI‑generated classifications, confirming correct diagnoses or correcting misclassifications.

Strict schema validation of JSON results (empty fields, enum mismatches, etc.).

Aggregation into a unified view and generation of an Excel report.

Snapshot isolation ensures that unverified AI hallucinations cannot affect final metrics.

Phase 8: Documentation and Archiving

All non‑perfect case attribution records are summarised into the three categories, producing a defect distribution report that is automatically published to the team knowledge base via the Model Context Protocol (MCP).

Iterative Knowledge Back‑fill

After each evaluation round, the system extracts generalized rules from correction logs and writes them into sub_agent_prompt.md. These rules enrich the sub‑agent knowledge base, enabling the agents to recognise similar patterns automatically in subsequent runs. Example rules include:

When column names differ but query results match, treat the case as a correct tool‑parameter parsing rather than a defect.

If the final conclusion is perfect but step scores are deducted, prefer labeling the issue as corpus coverage insufficiency.

Performance Gains

Manual analysis of 104 test cases previously required >1 person‑day. The new pipeline reduces human‑in‑the‑loop time to ~30 minutes, a >90 % efficiency improvement, while converting expert effort into low‑cost LLM inference. Token‑controlled sandbox execution keeps API costs bounded.

Future Directions

Planned enhancements include expanding the corpus for long‑tail scenarios, enriching sub‑agent domain knowledge, further automating rule extraction, and integrating a coding harness that allows agents to modify their own logic within an isolated sandbox, completing the transition from automated diagnosis to self‑evolving code.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelPython ScriptsAgent EvaluationAI Pre‑annotationAutomated PipelineKnowledge Iteration
AntData
Written by

AntData

Ant Data leverages Ant Group's leading technological innovation in big data, databases, and multimedia, with years of industry practice. Through long-term technology planning and continuous innovation, we strive to build world-class data technology and products.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.