How DiMo-GUI Boosts Multimodal LLMs for GUI Grounding Without Training
DiMo-GUI is a plug‑and‑play framework that dramatically improves multimodal large language models' ability to locate GUI elements by using a hierarchical dynamic visual reasoning loop and modality‑aware optimization, achieving up to double the performance on high‑resolution GUI benchmarks without any additional training data.
Abstract: This paper presents DiMo-GUI, a zero‑training GUI grounding framework for multimodal large language models (MLLMs). By employing dynamic visual reasoning and modality‑aware optimization, DiMo‑GUI iteratively crops focus regions and separates text and icon modalities, substantially reducing visual redundancy and balancing multimodal processing. Evaluations on the latest ScreenSpot‑Pro benchmark show significant performance gains, and the framework is applicable to web navigation and mobile app automation.
Introduction
Graphical User Interfaces (GUIs) are pervasive in automation and OS control, making natural‑language‑based GUI grounding a crucial research direction for MLLMs. However, visual complexity, linguistic ambiguity, and spatial clutter in GUI environments pose serious challenges for precise grounding.
Key Improvements
Dynamic Visual Localization: DiMo‑GUI adopts a multi‑stage scaling mechanism that starts with a coarse prediction, generates candidate focus regions, and iteratively crops the image to hone in on the target. The process stops when successive coordinate shifts fall below one‑sixth of the image diagonal, preventing over‑thinking.
Modality‑Aware Optimization: GUI elements are split into text and icon groups, each processed independently to produce text coordinates ( C_text) and icon coordinates ( C_icon). The final target ( C*) is selected by jointly evaluating the original instruction and the full‑resolution image, effectively balancing text and icon handling.
Experimental Results
Without any extra training or data, DiMo‑GUI markedly improves performance during inference. On the high‑resolution ScreenSpot‑Pro dataset, OS‑Atlas‑7B gains more than a two‑fold increase (18.9% → 49.7%), while UGround‑7B and UGround‑V1‑7B each see over 10% improvement. Similar gains are observed on the simpler ScreenSpot dataset, and qualitative analysis shows the dynamic localization progressively approaches the correct result.
Conclusion
DiMo‑GUI offers an efficient, universal, and training‑free GUI grounding framework that significantly enhances multimodal LLM performance in complex GUI settings through dynamic visual reasoning and modality‑aware optimization. Its plug‑and‑play nature enables seamless integration into existing GUI agents for web navigation and mobile automation, with future work exploring backtracking mechanisms to further improve robustness.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
