Artificial Intelligence 5 min read

MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding

The CVPR 2025 paper "MP-GUI: Modality Perception with MLLMs for GUI Understanding" presents a novel algorithm that enhances multimodal large language models' ability to perceive and reason about graphical user interfaces by integrating text, visual, and spatial signals through specialized perception modules and a dynamic fusion gate, achieving state‑of‑the‑art performance on multiple GUI benchmarks.

AntTech

Mar 14, 2025

MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding

At CVPR 2025, a paper titled "MP-GUI: Modality Perception with MLLMs for GUI Understanding" was accepted, reporting a collaboration between Ant Group and Zhejiang University’s EAGLE lab.

Graphical user interfaces (GUIs) are ubiquitous, but their handcrafted visual elements and dense layouts pose challenges for multimodal large language models (MLLMs), which struggle with fine‑grained element localization and spatial relationship understanding.

To address these issues, the authors propose the MP‑GUI algorithm, which employs three dedicated GUI perception modules to extract textual, visual, and spatial relationship signals from screens, and a semantic‑guided dynamic fusion gate to combine these signals effectively.

The system also introduces a data‑generation pipeline that leverages Qwen2VL‑72B to synthesize large‑scale GUI‑related training data, supporting the fusion gate’s training.

A multi‑stage training strategy (MTS) is designed for MP‑GUI’s architecture, incorporating distinct training objectives and a novel spatial relationship prediction (SRP) task that explicitly models UI element spatial hierarchies, thereby improving the model’s contextual awareness.

Extensive benchmark evaluations demonstrate MP‑GUI’s superiority on GUI understanding tasks such as widget localization, screen summarization, and screen‑based question answering, as well as on screen navigation benchmarks (AITW/Mind2Web), especially for small‑size widget detection.

The authors highlight practical applications, noting that MP‑GUI can accelerate automated software testing and enable GUI agents for mobile office automation, offering significant efficiency and cost benefits.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Computer Vision Multimodal LLM MLLM CVPR2025 GUI Understanding

Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.