Artificial Intelligence 8 min read

How AI Agents Like UFO, Mobile-Agent, and UI-TARS Are Shaping 2025 Smartphones

The article examines the underlying GUI‑Agent technologies behind the 2025 “Doubao” smartphone, comparing Microsoft’s UFO series, Alibaba’s Mobile‑Agent v2/v3, and ByteDance’s UI‑TARS, detailing their model foundations, input modalities, action spaces, planning mechanisms, learning strategies, open‑source status, and multi‑agent frameworks.

PaperAgent

Dec 10, 2025

How AI Agents Like UFO, Mobile-Agent, and UI-TARS Are Shaping 2025 Smartphones

Key Dimensions Comparison

Core Positioning: Alibaba Mobile-Agent – mobile‑focused multi‑agent system; ByteDance UI‑TARS – cross‑platform native agent model; Microsoft UFO – heterogeneous cross‑platform framework.

Input Modality: Mobile‑Agent uses screenshots + OCR + icon detection; UI‑TARS relies on pure visual screenshots; UFO combines UI Automation (UIA), visual cues, and text.

Model Base: Mobile‑Agent builds on a self‑trained multimodal model (Qwen2.5‑VL); UI‑TARS uses a self‑trained Vision‑Language Model ranging from 2 B to 72 B parameters; UFO leverages GPT‑4 with vision capabilities.

Action Space: Mobile‑Agent issues Android ADB commands; UI‑TARS unifies GUI atomic actions, keyboard/mouse, terminal commands, and APIs; UFO supports UIA, Win32, COM, and generic GUI actions.

Planning Mechanism: Mobile‑Agent employs multi‑agent collaboration with ReAct‑style reflection; UI‑TARS follows a System‑2 reasoning chain (thought → action); UFO adopts a dual‑brain HostAgent + AppAgent architecture.

Continual Learning: Mobile‑Agent relies on manual rules and trajectory replay; UI‑TARS uses a multi‑turn reinforcement‑learning data flywheel; UFO incorporates Retrieval‑Augmented Generation (documents + Bing + experience).

Open‑Source Status: Mobile‑Agent’s model and demo are open; UI‑TARS models are fully released on HuggingFace; UFO is MIT‑licensed and completely open.

Longest Leg (Key Strength): Mobile‑Agent excels at multi‑agent division and self‑reflection; UI‑TARS offers end‑to‑end VLM across platforms; UFO provides system‑level APIs and RAG knowledge.

Alibaba Mobile‑Agent

GUI‑Owl – Unified Multimodal Foundation Model

Positioning: First native end‑to‑end multimodal GUI‑agent model that unifies perception, localization, reasoning, planning, and execution.

Base Model: Built on Qwen2.5‑VL and further trained on large‑scale GUI interaction data.

Capabilities: Supports cross‑platform GUI automation (Android, Windows, macOS, Web) and both single‑agent autonomy and multi‑agent collaboration.

Multi‑Agent Framework

Manager: Strategic planner that decomposes user commands into sub‑goals and dynamically adjusts plans.

Worker: Executor that selects and performs actionable sub‑goals based on the current state.

Reflector: Self‑evaluation module that judges execution success and generates feedback.

Notetaker: Memory module that records key information (e.g., verification codes, order numbers) for reuse across steps.

RAG Module: Real‑time retrieval of external knowledge such as weather or tutorials.

State‑Driven Loop: Execute → Feedback → Update Plan → Continue.

https://arxiv.org/abs/2508.15144
Mobile-Agent‑v3: Fundamental Agents for GUI Automation
https://github.com/X-PLUG/MobileAgent

ByteDance UI‑TARS

UI‑TARS compresses perception, reasoning, memory, and action into a single Vision‑Language Model trained on 50 B tokens. Three model sizes are released on HuggingFace: 2 B (on‑device), 7 B (edge), and 72 B (cloud).

System‑2 Reasoning Chain: Generates an explicit “thought” draft before producing an action, enabling dynamic decomposition, reflection, and error correction.

Data Flywheel: Uses sandboxed task generation and reinforcement learning to continuously create new training data; model updates occur bi‑weekly.

Mixed Action Flow: A single task can invoke GUI clicks, terminal commands, and APIs; demo shows opening Notion, crawling data, running Python analysis, and writing results back to the page.

https://arxiv.org/pdf/2509.02544
https://github.com/bytedance/ui-tars
UI‑TARS‑2 Technical Report: Advancing GUI Agent with Multi‑Turn Reinforcement Learning

Microsoft UFO

The UFO series (UFO → UFO2 → UFO3) evolves from basic UI automation to a multi‑device orchestration framework called Galaxy, which coordinates agents across heterogeneous platforms.

Declarative DAG Decomposition: Requests are broken into a dynamic Directed Acyclic Graph of TaskStar nodes with dependencies for automatic scheduling and runtime rewriting.

Result‑Driven Graph Evolution: The DAG adapts continuously based on execution feedback.

Heterogeneous, Asynchronous, Secure Orchestration: Capability‑based device matching, asynchronous execution, safety locks, and formal verification ensure reliable cross‑platform operation.

Unified Agent Interaction Protocol (AIP): WebSocket‑based secure coordination layer with fault tolerance and auto‑reconnect.

Template‑Based MCP Toolkit: Lightweight SDK for rapid agent development, integrating a Modular Component Platform (MCP) to extend tool functionality.

https://arxiv.org/pdf/2511.11332
UFO3: Weaving the Digital Agent Galaxy
https://github.com/microsoft/UFO/

mobile AI AI agents open-source multimodal models GUI automation comparative analysis

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.