How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks
Ant Group's open‑source native GUI agent UI‑Venus leverages multimodal large‑model and reinforcement‑learning techniques to outperform prior models on grounding and navigation benchmarks, while using a high‑quality data pipeline and a self‑evolving alignment mechanism to push the limits of GUI automation.
Ant Group recently released and open‑sourced a native GUI intelligent agent called UI‑Venus . Built on a multimodal large model, UI‑Venus can autonomously operate phones, computers, and web interfaces to execute complex GUI tasks, achieving state‑of‑the‑art (SOTA) results on several authoritative benchmark datasets.
Evaluation Dimensions
GUI‑Agent research focuses on two core capabilities: Grounding (understanding natural language and mapping it to visual actions, i.e., single‑step clicks) and Navigation (planning and executing multi‑step interactions such as clicks, swipes, and inputs). Grounding is the foundation for more complex tasks, while Navigation requires strong instruction comprehension, adherence, and generalization.
Grounding Performance
On the ScreenSpot‑V2 and ScreenSpot‑Pro grounding leaderboards, UI‑Venus achieved 95.3% / 61.9% respectively, surpassing previous leaders GTA1 (94.8% / 58.4%) and UI‑TARS‑1.5 (94.2% / 61.6%).
Navigation Performance
In online evaluation on the Android World leaderboard, UI‑Venus reached a SOTA score of 65.9, outperforming UI‑TARS‑1.5 (64.2) and SeedVL‑1.5 (62.1). Offline benchmarks such as Android Control and GUI‑Odyssey also show UI‑Venus achieving near‑optimal results.
Training Strategy and Data Pipeline
UI‑Venus adopts a reinforcement‑learning (RL) approach, which delivers stronger performance with far fewer training tokens compared to standard supervised fine‑tuning (SFT) or continued pre‑training. While models like UI‑TARS used 5 × 10¹⁰ tokens, UI‑Venus reaches SOTA with only about 4 × 10⁸ high‑quality tokens, thanks to a sophisticated data production pipeline.
The pipeline includes:
Data filtering using a multimodal model to summarize each action and verify trajectory correctness.
Data reconstruction, especially for information‑retrieval tasks, to generate richer trajectories.
Data generation via an automated framework that runs dozens of virtual phones in a cloud environment, producing high‑quality trajectories.
Three‑stage quality control (rule‑based, reward‑model, and human review) that yields roughly 350 K high‑quality trajectories.
Self‑Evolving Historical Thinking Alignment
To address inconsistencies between generated “Thinking” (reasoning) content and the model’s internal inference ability, UI‑Venus introduces a self‑evolving historical alignment mechanism. The model replaces original reasoning annotations with its own reasoning from the previous training round, dynamically adjusting the historical context and improving navigation performance.
Future Outlook
Ant Group plans to continue investing in the GUI‑Agent domain, aiming for breakthroughs in complex interface understanding, automated operation, and multimodal interaction. Enhanced GUI‑Agent capabilities are expected to bring significant value to finance, customer service, and office automation, driving smarter enterprises and better user experiences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
