Artificial Intelligence 13 min read

Multimodal Mobile AI Agent (Mobile‑Agent): From V1 to V2 and Open‑Source Practice

This article introduces Alibaba Tongyi Lab's multimodal mobile AI agent, Mobile‑Agent, covering the background of large‑model agents, the design and capabilities of V1 and V2, the multi‑agent framework, evaluation results, open‑source resources, and future development directions.

DataFunSummit

Jul 30, 2024

Multimodal Mobile AI Agent (Mobile‑Agent): From V1 to V2 and Open‑Source Practice

The presentation begins with an overview of large‑model agents, highlighting their extensive world knowledge, reasoning, planning, and tool‑calling abilities, which enable a wide range of AI applications.

It then defines a model‑based AI agent as a system that observes its environment, thinks, and takes autonomous actions, requiring a profile, memory, and planning components.

Recent advances such as Meta‑GPT, Auto‑GPT, HuggingGPT, and ModelScope‑Agent are mentioned as influential frameworks.

The authors describe three research tracks: efficiency‑oriented assistants, personalized profile agents, and multimodal agents that operate across mobile and PC terminals.

Mobile‑Agent V1, released in January, uses a pure‑vision approach without system data, supports multi‑app operations, and performs holistic perception, planning, and reflection. It demonstrates tasks like weather queries, video browsing, and navigation on Android emulators.

Mobile‑Agent V2 addresses V1's limitations on complex and Chinese instructions by adopting a Multi‑Agent architecture (Planning, Decision, and Reflection agents), enabling better handling of long‑sequence tasks, multi‑language support, and cross‑app operations such as ride‑hailing, social media interactions, and video searches.

Extensive dynamic evaluation on built‑in and third‑party apps shows V2’s superior success rate, step completion, and decision/reflection accuracy compared to V1, especially on longer sequences.

The open‑source release on GitHub includes demos, deployment scripts, and integration with ModelScope‑Agent, allowing users to run the system via Android emulators or screenshot‑based demos on ModelScope and Hugging Face.

Future work includes extending Mobile‑Agent to desktop terminals, iOS, gaming, and personalized services, as well as developing Mobile‑Agent V3 based on GPT‑4v/4o and exploring fully open‑source models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI open source large language model Multi-Agent AI Planning mobile agent

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.