Artificial Intelligence 5 min read

MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding

The CVPR 2025 paper "MP-GUI: Modality Perception with MLLMs for GUI Understanding" presents a novel algorithm that enhances multimodal large language models' ability to perceive and reason about graphical user interfaces by integrating text, visual, and spatial signals through specialized perception modules and a dynamic fusion gate, achieving state‑of‑the‑art performance on multiple GUI benchmarks.

AntTech
AntTech
AntTech
MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding

At CVPR 2025, a paper titled "MP-GUI: Modality Perception with MLLMs for GUI Understanding" was accepted, reporting a collaboration between Ant Group and Zhejiang University’s EAGLE lab.

Graphical user interfaces (GUIs) are ubiquitous, but their handcrafted visual elements and dense layouts pose challenges for multimodal large language models (MLLMs), which struggle with fine‑grained element localization and spatial relationship understanding.

To address these issues, the authors propose the MP‑GUI algorithm, which employs three dedicated GUI perception modules to extract textual, visual, and spatial relationship signals from screens, and a semantic‑guided dynamic fusion gate to combine these signals effectively.

The system also introduces a data‑generation pipeline that leverages Qwen2VL‑72B to synthesize large‑scale GUI‑related training data, supporting the fusion gate’s training.

A multi‑stage training strategy (MTS) is designed for MP‑GUI’s architecture, incorporating distinct training objectives and a novel spatial relationship prediction (SRP) task that explicitly models UI element spatial hierarchies, thereby improving the model’s contextual awareness.

Extensive benchmark evaluations demonstrate MP‑GUI’s superiority on GUI understanding tasks such as widget localization, screen summarization, and screen‑based question answering, as well as on screen navigation benchmarks (AITW/Mind2Web), especially for small‑size widget detection.

The authors highlight practical applications, noting that MP‑GUI can accelerate automated software testing and enable GUI agents for mobile office automation, offering significant efficiency and cost benefits.

computer visionmultimodal LLMMLLMCVPR2025GUI Understanding
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.