Artificial Intelligence 23 min read

Treating Automated Testing as AI Coding: Xiaohongshu GUI Agent Real‑World Review

During the 2026 Spring Festival promotion, Xiaohongshu replaced manual UI testing with a three‑layer AI‑driven GUI Agent that executed over 43,000 runs across 106 devices and 128 scenarios, achieving 58% automation, 82% AI‑generated case adoption, 68% bug recall, 98% stability and roughly $1 per test case while drastically cutting token costs.

Xiaohongshu Tech REDtech

May 12, 2026

Treating Automated Testing as AI Coding: Xiaohongshu GUI Agent Real‑World Review

The presentation, originally delivered at QCon Beijing 2026, details how Xiaohongshu’s Quality‑Efficiency R&D team engineered a GUI Agent to turn automated UI testing into an AI‑coding problem. During the two‑week Spring Festival promotion, four business lines released three times, and the traditional manual compatibility testing (QA engineers operating devices one‑by‑one) was replaced by an Agent that handled 106 device models across 128 test scenarios.

Key production metrics recorded were:

4.3 × 10⁴+ automated executions (including CI regression and Agent exploration)

58% overall automation rate (automated cases / total cases)

82% AI‑generated case adoption (human‑reviewed cases retained / total AI‑generated)

68% compatibility‑bug auto‑recall (automated bug recall / total bugs)

98% execution stability (multiple consecutive passes after filtering environment/network noise)

≈ $1 per test case (including model token, device and platform resource costs)

The team identified two root causes of previous testing pain points: (1) case stability – UI changes broke XPath scripts and text‑based locators; (2) business understanding – test knowledge was scattered across markdown, internal docs, and QA notes, making it hard for an Agent to act without human guidance.

To address these, they built a three‑layer architecture:

Business‑Intent Layer – structured natural‑language descriptions of “what to test”, stored in Git, reviewed and version‑controlled.

Agent‑Exploration Layer – the Agent interacts with the app in real time, following the intent while autonomously handling pop‑ups and UI changes without altering the original goal.

Executable‑Code Layer – deterministic test code generated by a Coding Agent, stored as CI‑regression scripts that incur zero token cost after verification.

A clear rollback mechanism links the layers: CI failure → code‑layer fix → if unresolved, return to exploration → if intent outdated, return to intent layer for human update.

Locating UI elements was tackled by ranking strategies from weakest to strongest UI‑change resistance (semantic → DOM → visual). The authors observed an inverse relationship: the more resistant a locator is to UI changes, the lower its execution determinism.

Given the limitations of a single large model, the system adopts two principles: (a) context‑level partitioning – each layer processes only the information it needs; (b) model‑level partitioning – high‑capacity LLMs handle intent and planning, while a lightweight visual sub‑Agent performs atomic perception.

The visual sub‑Agent uses Gemini 3 Flash, selected for its cost‑effectiveness and sufficient accuracy (≈ 69% on the ScreenSpot‑Pro benchmark). Although higher‑end models score better, their per‑call cost is 10–30× higher, making them unsuitable for high‑frequency element‑location tasks. The architecture mitigates the visual model’s weaknesses (long‑context attention drift, limited coding ability) by restricting its role to single‑step screenshot‑to‑coordinate inference.

Three fallback strategies for element location are employed:

Semantic Understanding – natural‑language interpretation, most stable across versions.

DOM Structure – parsing the view hierarchy when IDs or hierarchy are reliable.

Visual Recognition – screenshot + visual model as a last resort.

These layers are reflected in the execution engine’s three‑tier stack: Business layer (Python unittest + runner), Agent layer (a unified driver abstracting Android, iOS, HarmonyOS), and Element layer (providing the three locating strategies). This design enables a single test script to run on all three platforms without duplication.

Cost analysis shows that pure visual pipelines explode token usage (≈ $5 for a 5‑step case) and suffer from context overflow, while the hybrid approach reduces per‑case cost to about $1 and drives the regression cost toward “zero token” – a direction rather than a fully achieved state.

Lessons learned include:

Benchmarks should guide but not dictate iteration; over‑fitting to public suites leads to brittle solutions.

Pure exploration without business context yields low success (≈ 50%); adding natural‑language directives and knowledge‑base guidance raises success to ≈ 78%.

Future work aims to extend the framework beyond the Spring Festival scenario to search, instant‑messaging, and advertising modules, and to embed the Agent earlier in the development lifecycle (requirements review) so that testing becomes an integral, AI‑augmented part of product development rather than a post‑hoc fix.

Overall, the case study demonstrates how treating UI automation as AI coding—using a Coding Agent for codified, reviewable test logic and a lightweight ToolCall visual sub‑Agent for atomic perception—can dramatically improve coverage, stability, and cost efficiency while keeping the system maintainable and extensible.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM AI coding automated testing cost optimization Knowledge Base GUI Agent ToolCall Code-as-Action

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.