Multimodal UI Interaction Intent Recognition for Automated Front‑End Testing
Meituan’s in‑store platform team and Prof. Zhou Yangfan’s Fudan group built a lightweight multimodal UI interaction intent recognizer that fuses screenshots, visible text, and render‑tree attributes via a Vision‑Transformer and Chinese BERT self‑attention model, then clusters nodes with a supervised pairwise classifier, achieving the highest F1 scores among baselines on 158 annotated order‑page screenshots from four business lines, enabling automatically generated test cases that execute correctly on 89 % of 100 unseen pages and demonstrating robust, cross‑app generalization for large‑scale front‑end quality assurance.
Meituan's in‑store platform technology team and the research group of Prof. Zhou Yangfan at Fudan University jointly developed a multimodal UI interaction intent recognition model and an accompanying UI interaction framework to improve large‑scale UI testing.
The motivation stems from the growing volume of UI testing tasks, where static visual regression can be automated but interactive functional testing still relies heavily on manual scripts. Two main challenges are the diversity of front‑end technology stacks leading to heterogeneous component trees, and the visual variability of UI elements that hampers pure computer‑vision approaches.
The proposed solution treats UI intent recognition as a multimodal classification problem, fusing image snapshots, visible text, and render‑tree attributes. A two‑stage pipeline first classifies each render‑tree node using a self‑attention multimodal model (Vision Transformer for images and Chinese BERT for text) and then aggregates nodes into intent clusters via a supervised clustering model.
Experiments were conducted on 158 annotated order‑page screenshots from four Meituan business lines (hotel, KTV, escape room, tickets). The multimodal self‑attention model achieved the highest F1 scores compared with single‑modality baselines and a YOLOv7 detector, demonstrating the benefit of combining modalities and contextual attention.
For clustering, a supervised pairwise model was trained to predict whether two nodes belong to the same intent cluster, achieving reasonable Rand index scores despite the multi‑level nature of the task.
Using the recognized intents, intelligent test cases were generated (e.g., “order the first product” and “order the cheapest product”). These cases were executed across different apps and technology stacks, achieving 89% correct execution on 100 unseen pages, showing strong robustness and cross‑app generalization.
Future directions include expanding the dataset, enhancing pre‑training, and integrating the intent recognizer with large language models (LLM‑as‑controller and multimodal LLMs) to support tasks such as UI diff across resolutions, node matching, and automated test‑case generation.
The work demonstrates that a lightweight multimodal intent recognizer can provide accurate, generalizable UI understanding for large‑scale front‑end quality assurance.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
