How DiMo-GUI Boosts Multimodal LLMs for GUI Grounding Without Training

DiMo-GUI is a plug‑and‑play framework that dramatically improves multimodal large language models' ability to locate GUI elements by using a hierarchical dynamic visual reasoning loop and modality‑aware optimization, achieving up to double the performance on high‑resolution GUI benchmarks without any additional training data.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
How DiMo-GUI Boosts Multimodal LLMs for GUI Grounding Without Training

Abstract: This paper presents DiMo-GUI, a zero‑training GUI grounding framework for multimodal large language models (MLLMs). By employing dynamic visual reasoning and modality‑aware optimization, DiMo‑GUI iteratively crops focus regions and separates text and icon modalities, substantially reducing visual redundancy and balancing multimodal processing. Evaluations on the latest ScreenSpot‑Pro benchmark show significant performance gains, and the framework is applicable to web navigation and mobile app automation.

Introduction

Graphical User Interfaces (GUIs) are pervasive in automation and OS control, making natural‑language‑based GUI grounding a crucial research direction for MLLMs. However, visual complexity, linguistic ambiguity, and spatial clutter in GUI environments pose serious challenges for precise grounding.

Key Improvements

Dynamic Visual Localization: DiMo‑GUI adopts a multi‑stage scaling mechanism that starts with a coarse prediction, generates candidate focus regions, and iteratively crops the image to hone in on the target. The process stops when successive coordinate shifts fall below one‑sixth of the image diagonal, preventing over‑thinking.

Modality‑Aware Optimization: GUI elements are split into text and icon groups, each processed independently to produce text coordinates ( C_text) and icon coordinates ( C_icon). The final target ( C*) is selected by jointly evaluating the original instruction and the full‑resolution image, effectively balancing text and icon handling.

Experimental Results

Without any extra training or data, DiMo‑GUI markedly improves performance during inference. On the high‑resolution ScreenSpot‑Pro dataset, OS‑Atlas‑7B gains more than a two‑fold increase (18.9% → 49.7%), while UGround‑7B and UGround‑V1‑7B each see over 10% improvement. Similar gains are observed on the simpler ScreenSpot dataset, and qualitative analysis shows the dynamic localization progressively approaches the correct result.

Performance comparison chart
Performance comparison chart

Conclusion

DiMo‑GUI offers an efficient, universal, and training‑free GUI grounding framework that significantly enhances multimodal LLM performance in complex GUI settings through dynamic visual reasoning and modality‑aware optimization. Its plug‑and‑play nature enables seamless integration into existing GUI agents for web navigation and mobile automation, with future work exploring backtracking mechanisms to further improve robustness.

Qualitative result illustration
Qualitative result illustration
multimodal LLMtest-time scalingdynamic visual reasoningGUI groundingmodality-aware optimization
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.