Artificial Intelligence 18 min read

How to Turn AI into a Reliable S‑Level Web‑Testing Engineer: 4 Hard‑Earned Lessons

The article explains why powerful AI models still fail complex web‑testing tasks without proper business SOPs, and presents four concrete engineering lessons—limited context, explicit step definitions, checklist‑driven validation, and iterative self‑revision—to train an AI into a dependable, self‑checking S‑grade testing employee.

Tencent Technical Engineering

Mar 23, 2026

How to Turn AI into a Reliable S‑Level Web‑Testing Engineer: 4 Hard‑Earned Lessons

Why AI Still Fails Complex Tasks

Even though large models are intelligent, they often produce unstable or incomplete results on real‑world workflows because they lack a concrete business SOP. The author built a web-testing skill that can explore any website, generate a full test report, and even trigger automatic fixes, but discovered four recurring pitfalls.

Four Hard‑Earned Lessons

AI has a limited context window. When a task grows long, the model forgets earlier constraints, skips steps, or compresses output, resulting in a "looks finished but isn’t" outcome.

Skills must describe not only *what* to do but *how* to do it. Vague prompts like “please check the page” are insufficient; the skill must encode exact actions, such as scrolling to the bottom, rescanning after a tab switch, or verifying each link.

Self‑validation is essential. Every stage needs a checklist and a gate rule that blocks progress until the checklist passes. This prevents the model from silently ignoring low‑visibility steps.

Iterative, closed‑loop training beats one‑off prompt tweaking. Run the skill on real tasks, collect failure cases, let the AI analyse root causes, automatically generate revised skill fragments, and repeat until the quality stabilises.

Concrete Failure Cases and Fixes

1. Missed Page Discovery

The AI identified a tab and a blue link but failed to treat the hidden page as a required recursive entry, leaving the site only half‑explored. The fix was to add explicit rules such as:

Scroll to the bottom of each page before ending the scan.

After any tab change, re‑enumerate all links.

When a table row contains a link, click it and verify the target.

Never rely on visual cues alone to decide importance.

Perform a recursive self‑check before finishing a stage.

2. Output Prioritisation

When asked to produce three artefacts ( sitemap.md, test-report.md, test-report.html), the model only generated the HTML because it appeared the most complex. The solution was to enforce a strict generation order and verify each file immediately:

Generate sitemap.md first.

Then generate test-report.md.

Finally generate test-report.html.

After each file, check that it exists and its size > 0.

If any check fails, abort the stage.

3. Engineering Constraints Ignored

Embedding a long python3 -c command to embed base64 screenshots caused the shell command to exceed length limits and fail. The rule added was to avoid one‑liner scripts for large payloads, write files to disk first, and reference them instead of inlining.

4. Silent Structure Compression

During the final report generation, the model compressed the per‑page modules, screenshots, and UI/UX sections, producing a report that looked complete but missed critical details. A structural gate checklist was introduced to enforce:

Page module count equals sitemap page count.

Every page has a screenshot.

Every page includes a UI/UX review.

Every page contains a functional test table.

Every page has a problem summary.

HTML page‑card count matches page count.

Overall problem‑summary and fix‑plan tables exist.

If any checklist item fails, the next step cannot start and the stage cannot be declared complete.

Skill Architecture

The web-testing skill is organised as:

web-testing/

SKILL.md

– main driver defining triggers, workflow, gate rules, and failure handling. references/checklist-template.md – progress controller and stage gate definitions. references/report-template.md – contract for Markdown/HTML report output. references/ui-ux-checklist.md – detailed UI/UX scoring criteria.

By coupling SKILL.md with a concrete checklist, the AI is forced to remember constraints, perform self‑checks, and only proceed when all quality gates are satisfied.

Iterative Training Process

Run the skill on 3‑5 real tasks.

Record what was missed, skipped, or partially completed.

Prompt the AI to analyse the failures, identify root causes (e.g., missing rule, missing gate, context overflow).

Ask the AI to propose concrete rule additions (trigger condition, mandatory action, self‑check, penalty).

Automatically apply the suggested modifications to the skill.

Re‑run the task to verify the fix.

Repeat the loop until the output consistently meets the checklist.

This turns the AI from a mere executor into a co‑designer of its own SOP, steadily raising the reliability floor.

Key Takeaway

Training a skill is not about making the model smarter; it is about giving the model a professional SOP, self‑inspection mechanisms, and enforceable gates so that its delivery quality depends on a repeatable process rather than on occasional brilliance.