How Trip.com Cut Multilingual UI QA Costs by 90% with GUI Agent and Multi‑Agent AI

Trip.com built the "慧鉴天工" system that combines a GUI Agent, multi‑agent LQA algorithms, OODA‑loop architecture, and a knowledge‑graph‑enhanced pipeline to automate page collection, multilingual text extraction, and quality inspection across 31 languages, achieving over 90% cost reduction and 70%+ detection accuracy.

Ctrip Technology
Ctrip Technology
Ctrip Technology
How Trip.com Cut Multilingual UI QA Costs by 90% with GUI Agent and Multi‑Agent AI

Project Background

As Trip.com expands globally, its multilingual UI content sources (Google Translate, external vendors, hard‑coded strings) cause frequent translation errors, UI glitches, and inconsistent information. In Q4 2024, multilingual translation issues accounted for 15.8% of QA tickets, and manual inspection across 31 languages suffered from high cost and low consistency.

Technical Architecture

The solution, named "慧鉴天工," follows an OODA (Observe‑Orient‑Decide‑Act) loop to create an end‑to‑end autonomous system that discovers problems, judges root causes, and executes repairs.

Observe & Locate: A visual‑capable GUI Agent automatically captures pages and plans navigation paths.

Orient & Decide: A set of multi‑agents performs precise defect detection and root‑cause analysis.

Act: The system triggers data modifications in the multilingual copy system and creates QA tickets, turning intelligent judgments into concrete fixes.

Automated GUI Collection

Traditional script‑based crawlers (Selenium, Appium) cannot keep up with Trip.com’s rapid multi‑platform releases. Using a Visual‑Language Model (VLM) the team built a "plan‑act‑decide" navigation agent that learns to perceive layouts and interactions without prior knowledge, shrinking page‑onboarding time from days to minutes.

Closed‑source models exhibited two major defects in OTA scenarios: they failed to understand Trip.com UI components and showed low success rates on long‑range tasks such as navigating from the homepage to a hotel‑booking form.

Examples of failures include repeatedly missing the "Expand more" button and selecting a four‑star rating when five stars were required.

Model Fine‑Tuning

A three‑stage training pipeline was designed for the Qwen 3.6 27B model:

Stage 1 & 2 improve generalization on OTA scenes; Stage 3 targets long‑tail issues.

Stage 2 replaces binary rewards with a Gaussian‑distributed continuous reward, giving partial credit (e.g., 0.8 for near‑miss clicks) and greatly stabilizing training on long‑trajectory tasks.

Stage 3 adds a DPO‑based post‑training step using a small set of positive‑negative samples to refine rare error cases.

On the business evaluation set, the fine‑tuned model outperformed the commercial Qwen 3.5 Plus by more than 10 percentage points in success rate, and reduced the average step count per task from 12.28 to 8.12.

Target-language grammar/syntax error that makes the meaning ambiguous, unprofessional, or harder to parse (e.g., incorrect ordinal/quantity wording like missing “-rd/-th” or hyphenation, wrong agreement, broken sentence boundaries/capitalization). Counterexample: Minor stylistic preference differences that remain grammatical and unambiguous.</code><code>Semantic mistranslation (wrong sense/role): The translation selects a different concept, action, relationship, or process than the source expresses (e.g., booking vs seat selection; refund vs cancel; recommend vs instruct; business class vs premium cabin). Counterexample: Acceptable generalization that does not change user‑relevant meaning (e.g., “customer support” for “support team” when ‘human‑agent’ is not specified).</code><code>Omission/under‑translation of explicit, user‑relevant meaning: The translation drops key qualifiers or constraints present in the source (time words like “just/temporarily/already”, entities/brand names, non‑transferability, scope/conditions), changing interpretation.</code><code>Unclear/non‑native rendering that blocks understanding of the referent or intended UI meaning (e.g., unexplained transliteration, unnatural collocation that users can’t interpret, unclear naming). Counterexample: Standard, widely recognized transliterations/proper nouns that are understandable to the target audience in context.</code><code>Placeholder/markup/formatting handling error (non‑terminology): Variables/placeholders/HTML tags or required spacing/punctuation/casing patterns are altered or embedded in a way that breaks display or readability (e.g., wrong spacing around arrows/CTA, broken tag structure). Counterexample: Benign punctuation localization that preserves placeholders and does not affect rendering.

Data Production Pipeline

Large‑scale real‑device automation on Trip.com’s internal mobile platform generates high‑quality GUI trajectories. For the hotel booking flow, the team splits the journey into three segments (list, detail, form), identifies key UI controls, enumerates possible actions, and uses multiple closed‑source models to sample successful trajectories. The resulting instruction‑driven sequences produce thousands of >10‑step trajectories for training.

Knowledge‑Graph‑Enhanced Agent

Because pages evolve (layout tweaks, A/B tests), a knowledge graph is introduced to keep the agent’s understanding fresh. The graph has three layers:

Page Profile: page_id, locale, platform, brand, overview.

Module Profile: information about each page module.

Capability Profile: executable abilities derived from the page.

Update principles include retaining history, smooth incremental updates, automatic retirement of stale modules, perception‑free recall, and quick rollback to previous snapshots.

Fine‑Grained Page Information Extraction

Two extraction modes are required for downstream multilingual LQA:

Requirement 1: character‑level extraction for monolingual checks.

Requirement 2: bilingual pairing of corresponding text fragments across languages.

Requirement 3: handling of special layout or animation components.

Pure visual VLM extraction is insufficient because any missing or extra character breaks the detection logic. Therefore, the system uses DOM‑based APIs (Chrome DevTools Protocol, Chrome extensions) combined with layout calculations.

Pre‑processing steps:

Resize the viewport via CDP to capture the full page.

Freeze dynamic content by hooking timers and animation methods before preload scripts run.

Core extraction proceeds by traversing the DOM, collecting text nodes and their container elements, merging fragments based on layout proximity, and performing visibility checks that handle zero‑size parents, overflow, and nested visibility.

Additional handling normalizes whitespace, respects CSS properties like pre, and computes equivalent spacing based on font metrics to reproduce what users actually see.

Bilingual Text Pairing

Because Trip.com’s multilingual pages share identical structure, the XPath string of each text node is hashed to create a unique key. Using the English page as the base, the system matches keys on other language pages to produce bilingual text pairs.

Post‑Processing

After DOM extraction, OCR expert models provide secondary corrections for invisible elements or coordinate offsets, improving recall for hidden or animated components.

Multilingual LQA Algorithm

The detection module evaluates translated text against an LQA standard that defines eight major error categories and over twenty sub‑categories, each with severity levels (Major, Critical).

Before deployment, manual QA for 31 languages required massive local annotator effort, leading to high cost and low consistency.

Iterative LLM‑based experiments showed that even top commercial closed‑source models could not achieve usable accuracy across all languages. The team therefore designed an online‑learning system with two key requirements: (1) continuous rule updates from live QA feedback, and (2) focus on Major and Critical issues only.

The pipeline runs:

Load language‑specific detection rules.

First pass using Gemini 3.1 Flash for translation quality judgment.

Second verification with GPT 5.5 to reduce hallucinations.

Output results.

On a large‑scale validation set, the system achieved 70.92% precision and 92.87% accuracy for Major problems. After integration, human experts only need to propose fixes for detected issues, eliminating the "find‑problem" step.

Results and Outlook

The project delivered three major outcomes:

Cost reduction: Automated workflow cut multilingual UI QA cost by over 90%.

Coverage expansion: Quality control now spans 31 languages, seven product lines, and more than 80% of core traffic.

Quality improvement: Issue recall exceeds 90%, and bilingual detection accuracy surpasses 70%.

Future work will focus on further improving GUI Agent success on long‑range tasks and extending the detection capability from multilingual QA to broader user‑experience issues.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Large Language Modelknowledge graphMulti-AgentGUI AgentOODA LoopQwen 3.6multilingual QA
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.