MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding
The CVPR 2025 paper "MP-GUI: Modality Perception with MLLMs for GUI Understanding" presents a novel algorithm that enhances multimodal large language models' ability to perceive and reason about graphical user interfaces by integrating text, visual, and spatial signals through specialized perception modules and a dynamic fusion gate, achieving state‑of‑the‑art performance on multiple GUI benchmarks.
At CVPR 2025, a paper titled "MP-GUI: Modality Perception with MLLMs for GUI Understanding" was accepted, reporting a collaboration between Ant Group and Zhejiang University’s EAGLE lab.
Graphical user interfaces (GUIs) are ubiquitous, but their handcrafted visual elements and dense layouts pose challenges for multimodal large language models (MLLMs), which struggle with fine‑grained element localization and spatial relationship understanding.
To address these issues, the authors propose the MP‑GUI algorithm, which employs three dedicated GUI perception modules to extract textual, visual, and spatial relationship signals from screens, and a semantic‑guided dynamic fusion gate to combine these signals effectively.
The system also introduces a data‑generation pipeline that leverages Qwen2VL‑72B to synthesize large‑scale GUI‑related training data, supporting the fusion gate’s training.
A multi‑stage training strategy (MTS) is designed for MP‑GUI’s architecture, incorporating distinct training objectives and a novel spatial relationship prediction (SRP) task that explicitly models UI element spatial hierarchies, thereby improving the model’s contextual awareness.
Extensive benchmark evaluations demonstrate MP‑GUI’s superiority on GUI understanding tasks such as widget localization, screen summarization, and screen‑based question answering, as well as on screen navigation benchmarks (AITW/Mind2Web), especially for small‑size widget detection.
The authors highlight practical applications, noting that MP‑GUI can accelerate automated software testing and enable GUI agents for mobile office automation, offering significant efficiency and cost benefits.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.