Artificial Intelligence 16 min read

From J.A.R.V.I.S. to Real AI Agents: A Must‑Read Guide to Modern GUI Agents

This article provides a comprehensive overview of AI agents, focusing on GUI‑based agents, their definitions, classifications, core capabilities, recent research such as OpenAI's ComputerUse, SpiritSight and MobileFlow, practical applications, technical and security challenges, and future development directions.

Software Engineering 3.0 Era

Mar 14, 2025

From J.A.R.V.I.S. to Real AI Agents: A Must‑Read Guide to Modern GUI Agents

1. Definition and Classification

An AI agent is an autonomous system that perceives its environment, makes decisions, and takes actions to achieve specific goals, possessing memory, planning, tool use, and reflection capabilities. Compared with traditional AI, agents are self‑directed, continuous, and adaptable.

1.1 What is an Agent?

Agents can sense, plan, and act, often illustrated with a diagram showing perception, reasoning (including chain‑of‑thought), and execution loops.

1.2 OS Agent (Operating‑System Agent)

OS Agents interact with graphical user interfaces (GUI) of computers and mobile devices. According to the recent OS Agent survey, they consist of three key components:

Environment – the operating system context (Windows, macOS, Android, etc.).

Observation space – how the agent gathers information (screen captures, DOM structures, etc.).

Action space – the set of possible operations (click, type, swipe, etc.).

1.3 Main Categories of GUI Agents

Based on input modality and implementation, GUI agents are divided into:

Language‑only agents : use textual descriptions such as HTML/XML.

Vision‑only agents : rely solely on screen screenshots.

Vision‑language hybrid agents : combine screenshots with textual cues.

Vision‑only agents (e.g., SpiritSight) and hybrid agents (e.g., MobileFlow) are currently research hotspots because of their cross‑platform compatibility and rich perception.

2. Core Capabilities of Agents

2.1 Understanding

Understanding refers to interpreting user commands and task goals. MobileFlow introduces a GUI Chain‑of‑Thought (CoT) technique that enables step‑by‑step reasoning similar to human thought, improving comprehension of complex tasks.

2.2 Perception and Localization

Perception is the foundation for GUI agents. A major challenge is element grounding:

SpiritSight proposes Universal Block Parsing (UBP) to resolve ambiguities in high‑resolution dynamic inputs.

MobileFlow’s hybrid visual encoder supports variable‑resolution inputs, enhancing detail perception.

OpenAI’s ComputerUse employs a closed‑loop visual‑operating‑system pipeline to analyze the entire screen and execute precise actions.

2.3 Planning

Planning splits complex tasks into sequential steps. The survey distinguishes two planning styles:

Global planning – a complete action sequence is generated before execution.

Iterative planning – the plan is continuously refined based on environmental feedback.

MobileFlow exemplifies iterative planning with a four‑step loop: observe → reason → act → summarize.

2.4 Execution

Typical GUI actions include mouse/touch clicks, long‑presses, drags, keyboard input, shortcuts, scrolling, paging, and tab switching.

3. State‑of‑the‑Art Agent Technologies

3.1 OpenAI ComputerUse

ComputerUse enables AI agents to directly manipulate computer interfaces. It builds on the Computer‑Using Agent (CUA) model and leverages GPT‑4o’s visual and reasoning abilities.

Workflow: command understanding → action generation → execution & feedback → state update → iterative improvement.

Supported environments: browsers, macOS, Windows, Ubuntu (mobile platforms not yet supported).

Applications: automated testing, exploratory testing, regression testing, cross‑platform consistency testing.

3.2 SpiritSight – Vision‑Driven GUI Agent

Innovation: introduces the GUI‑Lasagne multi‑level dataset and Universal Block Parsing method.

Features: end‑to‑end pure visual perception, no need for HTML/XML assistance.

Performance: surpasses existing methods on benchmarks such as Multimodal‑Mind2Web.

Cross‑language ability: fine‑tuned on small target‑language data to operate GUIs in languages like Chinese.

3.3 MobileFlow – Mobile‑Focused Agent

Architecture: based on Qwen‑VL‑Chat, hybrid visual encoder, 21 B‑parameter scale.

Features: variable‑resolution input, strong multilingual support, mixture‑of‑experts (MoE) structure.

Training: GUI alignment (localization, reference, QA, description) combined with GUI Chain‑of‑Thought.

Deployments: successfully used in software testing and ad‑preview review scenarios.

4. Application Scenarios

4.1 GUI Automation Testing

Exploratory testing – agents autonomously explore app functions and detect abnormal UI states.

Regression testing – agents remember historical interaction paths and adapt to UI changes.

Cross‑platform testing – agents validate functionality across devices, browsers, and OSes.

Visual reporting – agents generate textual descriptions and screenshots for developers.

Compared with traditional automation, agent‑based testing requires no element‑locating code and adapts to UI changes thanks to multimodal understanding.

4.2 Mobile App Automation

E‑commerce – automatic product search, comparison, checkout, and payment.

Form filling – auto‑populate registration or application forms.

Content aggregation – collect and consolidate information from multiple apps.

Intelligent assistants – execute multi‑step tasks such as travel booking or meeting scheduling.

4.3 Desktop System Automation

Document processing – create, edit, and format documents automatically.

Data analysis – collect, clean, analyze, and visualize data pipelines.

System management – file handling, software install/uninstall, configuration.

Creative tools – assist image editing, video clipping, and other creative workflows.

5. Challenges Facing Agents

5.1 Technical Challenges

Reliability – OpenAI reports CUA performance at 38.1 % on OS tasks versus higher scores on browser tasks.

Element grounding – despite UBP, precise localization remains difficult.

Long‑sequence tasks – reliability drops for multi‑step, time‑consuming operations.

Complex reasoning – limited ability to handle multi‑page, multi‑condition logic.

Multilingual support – performance on non‑English interfaces is weaker.

5.2 Security and Privacy

Prompt injection – malicious interfaces may inject harmful prompts.

Privacy leakage – agents may encounter sensitive data during operation.

Permission control – need mechanisms to restrict agents to authorized actions.

Potential abuse – unauthorized automated actions could be exploited.

5.3 Deployment and Integration

Compute cost – high‑quality GUI agents require large models and significant resources.

Latency – real‑time interaction demands low latency despite heavy visual processing.

System integration – seamless incorporation into existing workflows needs extra development.

Version compatibility – UI updates require agents to continuously adapt.

6. Future Directions

6.1 Technical Evolution

Self‑improvement – agents learn from test results to refine strategies.

Multimodal fusion – deeper integration of vision, text, audio, etc.

Domain specialization – tailored agents for finance, healthcare, and other sectors.

Tool augmentation – embedding OCR, computer‑vision, search, and other utilities.

6.2 Cross‑Platform and Generalization

Unified interfaces – standard APIs that work across devices and platforms.

Mobile‑desktop collaboration – coordinated tasks between phones and PCs.

Web‑native convergence – support for both web and native applications.

IoT control – extending agents to smart‑home and industrial device interfaces.

6.3 Personalization and Self‑Evolution

User preference learning – adapt to individual habits and preferences.

Continuous adaptation – evolve as user behavior changes.

Proactive suggestions – propose task optimizations based on historical data.

Self‑assessment – agents evaluate their performance and improve strategies.

Conclusion

GUI‑based AI agents have progressed rapidly from concepts like OpenAI’s ComputerUse to practical systems such as SpiritSight and MobileFlow. They are reshaping software testing and human‑computer interaction, and while technical, security, and ethical hurdles remain, their potential to become capable digital assistants is substantial.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents multimodal LLM GUI automation ComputerUse MobileFlow SpiritSight

Written by

Software Engineering 3.0 Era

With large models (LLMs) reshaping countless industries, software engineering is leading the charge into the Software Engineering 3.0 era—model-driven development and operations. This account focuses on the new paradigms, theories, and methods of SE 3.0, and showcases its tools and practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.