How Multimodal Large Models Revolutionize UI Automation Testing
This article details how Alibaba's Ant Group leverages multimodal large‑language models and multi‑agent architectures to create a low‑code, AI‑driven UI automation testing framework that improves test coverage, reduces manual effort, and scales across diverse mobile mini‑program scenarios.
Introduction
Zhu Jiali from Alipay Technology presented at QECon 2025 on "UI Automation Testing Based on Multimodal Large Models," introducing a novel AI‑driven testing approach.
Problem Background
Mini‑program quality inspection is highly complex; manual evaluation is subjective, costly, and cannot fully cover the entire business flow, leading to significant resource consumption.
AI Automation Solution
An intelligent solution was developed that uses deep learning and multimodal large models to automatically detect UI pages and interaction flows, generate AI test cases, and maintain low overhead while ensuring functional stability and user experience.
TestFun Platform Features
Cross‑terminal support : Seamlessly integrates simulators, virtual machines, and real devices for consistent testing.
Multidimensional testing : Covers compatibility, performance, and other quality dimensions.
Out‑of‑the‑box usage : Provides account pool management and multi‑environment switching to simplify test preparation.
Closed‑loop management : Automates regression testing and full‑process quality management for continuous improvement.
Challenges in Existing Testing
Test case freshness is hard to maintain due to rapid iteration and platform fragmentation.
Business scenarios are complex with intricate interactions across multiple tech stacks.
High resource consumption: real‑device costs and large task volumes.
Low stability caused by device and network anomalies.
Methodology
1. Data‑Driven Approach
Large‑model training relies on massive, accurate UI data collected from the platform. The pipeline includes data filtering, model refinement, human verification, preprocessing, training, and business evaluation, ultimately producing new model weights that reduce manual effort.
2. Multi‑Agent Construction
Complex task flows are decomposed using a suite of agents:
Planning Agent : Breaks down complex intents into simple, single‑step intents.
Action Agent : Maps each simple intent to concrete actions and parameters.
Reflection Agent : Reviews and corrects erroneous actions.
Additional visual and textual tools enrich input and assist decision‑making.
3. Route RAG (Retrieval‑Augmented Generation)
Known paths are stored in a knowledge base; agents retrieve relevant routes and domain knowledge before making decisions, improving success rates for long interaction sequences.
Business Impact
AI‑generated test cases exceeded 12k, raising automation coverage from 50% to 70%.
Deployed for Alipay mini‑program review and daily quality inspection, reducing user complaints and labor costs.
Earned multiple awards and patents, including the 2024 AI Pioneer Case by the China AI Industry Alliance.
Published research such as "MobileFlow: A Multimodal LLM For Mobile GUI Agent" (NeurIPS 2024 workshop).
Future Outlook
Building on the current solution and advancing large‑model capabilities, the team aims to further enhance technical performance and user experience, extending the framework to more complex tasks and broader application domains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
