How AI Agents Like UFO, Mobile-Agent, and UI-TARS Are Shaping 2025 Smartphones
The article examines the underlying GUI‑Agent technologies behind the 2025 “Doubao” smartphone, comparing Microsoft’s UFO series, Alibaba’s Mobile‑Agent v2/v3, and ByteDance’s UI‑TARS, detailing their model foundations, input modalities, action spaces, planning mechanisms, learning strategies, open‑source status, and multi‑agent frameworks.
Key Dimensions Comparison
Core Positioning: Alibaba Mobile-Agent – mobile‑focused multi‑agent system; ByteDance UI‑TARS – cross‑platform native agent model; Microsoft UFO – heterogeneous cross‑platform framework.
Input Modality: Mobile‑Agent uses screenshots + OCR + icon detection; UI‑TARS relies on pure visual screenshots; UFO combines UI Automation (UIA), visual cues, and text.
Model Base: Mobile‑Agent builds on a self‑trained multimodal model (Qwen2.5‑VL); UI‑TARS uses a self‑trained Vision‑Language Model ranging from 2 B to 72 B parameters; UFO leverages GPT‑4 with vision capabilities.
Action Space: Mobile‑Agent issues Android ADB commands; UI‑TARS unifies GUI atomic actions, keyboard/mouse, terminal commands, and APIs; UFO supports UIA, Win32, COM, and generic GUI actions.
Planning Mechanism: Mobile‑Agent employs multi‑agent collaboration with ReAct‑style reflection; UI‑TARS follows a System‑2 reasoning chain (thought → action); UFO adopts a dual‑brain HostAgent + AppAgent architecture.
Continual Learning: Mobile‑Agent relies on manual rules and trajectory replay; UI‑TARS uses a multi‑turn reinforcement‑learning data flywheel; UFO incorporates Retrieval‑Augmented Generation (documents + Bing + experience).
Open‑Source Status: Mobile‑Agent’s model and demo are open; UI‑TARS models are fully released on HuggingFace; UFO is MIT‑licensed and completely open.
Longest Leg (Key Strength): Mobile‑Agent excels at multi‑agent division and self‑reflection; UI‑TARS offers end‑to‑end VLM across platforms; UFO provides system‑level APIs and RAG knowledge.
Alibaba Mobile‑Agent
GUI‑Owl – Unified Multimodal Foundation Model
Positioning: First native end‑to‑end multimodal GUI‑agent model that unifies perception, localization, reasoning, planning, and execution.
Base Model: Built on Qwen2.5‑VL and further trained on large‑scale GUI interaction data.
Capabilities: Supports cross‑platform GUI automation (Android, Windows, macOS, Web) and both single‑agent autonomy and multi‑agent collaboration.
Multi‑Agent Framework
Manager: Strategic planner that decomposes user commands into sub‑goals and dynamically adjusts plans.
Worker: Executor that selects and performs actionable sub‑goals based on the current state.
Reflector: Self‑evaluation module that judges execution success and generates feedback.
Notetaker: Memory module that records key information (e.g., verification codes, order numbers) for reuse across steps.
RAG Module: Real‑time retrieval of external knowledge such as weather or tutorials.
State‑Driven Loop: Execute → Feedback → Update Plan → Continue.
https://arxiv.org/abs/2508.15144
Mobile-Agent‑v3: Fundamental Agents for GUI Automation
https://github.com/X-PLUG/MobileAgentByteDance UI‑TARS
UI‑TARS compresses perception, reasoning, memory, and action into a single Vision‑Language Model trained on 50 B tokens. Three model sizes are released on HuggingFace: 2 B (on‑device), 7 B (edge), and 72 B (cloud).
System‑2 Reasoning Chain: Generates an explicit “thought” draft before producing an action, enabling dynamic decomposition, reflection, and error correction.
Data Flywheel: Uses sandboxed task generation and reinforcement learning to continuously create new training data; model updates occur bi‑weekly.
Mixed Action Flow: A single task can invoke GUI clicks, terminal commands, and APIs; demo shows opening Notion, crawling data, running Python analysis, and writing results back to the page.
https://arxiv.org/pdf/2509.02544
https://github.com/bytedance/ui-tars
UI‑TARS‑2 Technical Report: Advancing GUI Agent with Multi‑Turn Reinforcement LearningMicrosoft UFO
The UFO series (UFO → UFO2 → UFO3) evolves from basic UI automation to a multi‑device orchestration framework called Galaxy, which coordinates agents across heterogeneous platforms.
Declarative DAG Decomposition: Requests are broken into a dynamic Directed Acyclic Graph of TaskStar nodes with dependencies for automatic scheduling and runtime rewriting.
Result‑Driven Graph Evolution: The DAG adapts continuously based on execution feedback.
Heterogeneous, Asynchronous, Secure Orchestration: Capability‑based device matching, asynchronous execution, safety locks, and formal verification ensure reliable cross‑platform operation.
Unified Agent Interaction Protocol (AIP): WebSocket‑based secure coordination layer with fault tolerance and auto‑reconnect.
Template‑Based MCP Toolkit: Lightweight SDK for rapid agent development, integrating a Modular Component Platform (MCP) to extend tool functionality.
https://arxiv.org/pdf/2511.11332
UFO3: Weaving the Digital Agent Galaxy
https://github.com/microsoft/UFO/How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
