Multimodal UI Interaction Intent Recognition for Automated Front‑End Testing
Meituan’s in‑store platform team and Prof. Zhou Yangfan’s Fudan group built a lightweight multimodal UI interaction intent recognizer that fuses screenshots, visible text, and render‑tree attributes via a Vision‑Transformer and Chinese BERT self‑attention model, then clusters nodes with a supervised pairwise classifier, achieving the highest F1 scores among baselines on 158 annotated order‑page screenshots from four business lines, enabling automatically generated test cases that execute correctly on 89 % of 100 unseen pages and demonstrating robust, cross‑app generalization for large‑scale front‑end quality assurance.
