How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks

Ant Group's open‑source native GUI agent UI‑Venus leverages multimodal large‑model and reinforcement‑learning techniques to outperform prior models on grounding and navigation benchmarks, while using a high‑quality data pipeline and a self‑evolving alignment mechanism to push the limits of GUI automation.

AntTech
AntTech
AntTech
How UI‑Venus Achieves SOTA in Multimodal GUI Agent Benchmarks

Ant Group recently released and open‑sourced a native GUI intelligent agent called UI‑Venus . Built on a multimodal large model, UI‑Venus can autonomously operate phones, computers, and web interfaces to execute complex GUI tasks, achieving state‑of‑the‑art (SOTA) results on several authoritative benchmark datasets.

Evaluation Dimensions

GUI‑Agent research focuses on two core capabilities: Grounding (understanding natural language and mapping it to visual actions, i.e., single‑step clicks) and Navigation (planning and executing multi‑step interactions such as clicks, swipes, and inputs). Grounding is the foundation for more complex tasks, while Navigation requires strong instruction comprehension, adherence, and generalization.

Grounding Performance

On the ScreenSpot‑V2 and ScreenSpot‑Pro grounding leaderboards, UI‑Venus achieved 95.3% / 61.9% respectively, surpassing previous leaders GTA1 (94.8% / 58.4%) and UI‑TARS‑1.5 (94.2% / 61.6%).

Navigation Performance

In online evaluation on the Android World leaderboard, UI‑Venus reached a SOTA score of 65.9, outperforming UI‑TARS‑1.5 (64.2) and SeedVL‑1.5 (62.1). Offline benchmarks such as Android Control and GUI‑Odyssey also show UI‑Venus achieving near‑optimal results.

Training Strategy and Data Pipeline

UI‑Venus adopts a reinforcement‑learning (RL) approach, which delivers stronger performance with far fewer training tokens compared to standard supervised fine‑tuning (SFT) or continued pre‑training. While models like UI‑TARS used 5 × 10¹⁰ tokens, UI‑Venus reaches SOTA with only about 4 × 10⁸ high‑quality tokens, thanks to a sophisticated data production pipeline.

The pipeline includes:

Data filtering using a multimodal model to summarize each action and verify trajectory correctness.

Data reconstruction, especially for information‑retrieval tasks, to generate richer trajectories.

Data generation via an automated framework that runs dozens of virtual phones in a cloud environment, producing high‑quality trajectories.

Three‑stage quality control (rule‑based, reward‑model, and human review) that yields roughly 350 K high‑quality trajectories.

Self‑Evolving Historical Thinking Alignment

To address inconsistencies between generated “Thinking” (reasoning) content and the model’s internal inference ability, UI‑Venus introduces a self‑evolving historical alignment mechanism. The model replaces original reasoning annotations with its own reasoning from the previous training round, dynamically adjusting the historical context and improving navigation performance.

Future Outlook

Ant Group plans to continue investing in the GUI‑Agent domain, aiming for breakthroughs in complex interface understanding, automated operation, and multimodal interaction. Enhanced GUI‑Agent capabilities are expected to bring significant value to finance, customer service, and office automation, driving smarter enterprises and better user experiences.

multimodal AIbenchmarkreinforcement learningSOTAGUI agent
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.