Survey of Computer-Use Agents: Terminal/CLI vs GUI Paths

The article surveys recent advances in computer-use agents, categorizing them into terminal/CLI‑based and GUI‑based routes, detailing representative systems, benchmarks, and open challenges such as error accumulation, safety, and evaluation gaps.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Survey of Computer-Use Agents: Terminal/CLI vs GUI Paths

Large language models are extending from pure text generation to operating computers, enabling tasks such as file manipulation, command execution, web browsing, and graphical‑interface interaction. These systems are called computer‑use agents and follow a tool‑use loop: observe → decide → act → receive feedback → re‑plan.

Terminal/CLI‑Based Agents

Interaction is text‑only via shells, code execution, and file operations. Development progressed through several stages.

Code‑completion → Conversational Assistant

Early tools such as GitHub Copilot provided line‑level autocomplete without autonomous action. Later products like Cursor and Aider support task‑level natural‑language requests (e.g., “fix this bug”) but still require a human to apply the suggested changes.

Autonomous Programming Agents

Agents receive a goal, locate the problem, generate patches, run tests, and iterate until success, using the tool‑use loop.

SWE‑Agent introduces an agent‑computer interface (ACI) with dedicated code‑navigation and edit commands; experiments show ACI outperforms raw bash interaction.

OpenHands is an open‑source platform that supports multiple agent architectures, typically runs in Docker sandboxes, and provides a complete experimental environment for building and comparing autonomous programming agents.

Agentless adopts a three‑stage pipeline (locate → fix → verify) without a full loop, offering a simpler and cheaper design while remaining competitive.

CodeAct uses executable Python code as the action format instead of JSON tool calls, leveraging code’s compositional power to improve task success.

Productized Forms

Commercial “AI software engineer” products expose the capability directly to developers:

Devin – cloud‑IDE assistant that executes end‑to‑end software‑engineering tasks.

Claude Code , Codex CLI , Gemini CLI – terminal agents that run locally and integrate with existing toolchains.

IDE Integration

Agents embedded in editors enable multi‑step workflows without leaving the IDE:

Cursor Agent Mode and Windsurf read the project, edit files, run tests, and iterate based on results.

General‑Purpose Agents

When an LLM can write and run code, manage files, and access the network, its scope expands beyond software engineering.

Manus provides isolated environments per task and grants full access to shell, filesystem, package managers, and browsers.

Anthropic Agent SDK wraps Claude Code capabilities into a developer library for easy integration.

LLM‑in‑Sandbox demonstrates that even a minimal code sandbox dramatically improves performance on chemistry, physics, and mathematics tasks.

GUI‑Based Agents

These agents interact with visual interfaces via mouse clicks and keyboard input. The main difficulty is visual grounding—mapping natural‑language goals to specific screen regions and actions.

Web Agents

Web agents handle browser tasks (navigation, shopping, form filling). Perception strategies are:

DOM‑based : parse HTML structure.

Vision‑based : interpret screenshots; most systems combine both.

SeeAct (GPT‑4V) annotates screenshots to locate elements, showing the feasibility of visual web interaction.

WebVoyager is an end‑to‑end multimodal web agent that relies solely on screenshots to complete complex tasks.

Browser Use is an open‑source Python framework that abstracts browser session management and action execution.

Operator (OpenAI) is a cloud‑hosted browser agent that performs web tasks on behalf of the user.

Project Mariner (Google) is a Chrome extension that runs locally, executing web actions within the user’s own browser context.

Desktop / OS Agents

Target full operating‑system environments, requiring understanding of arbitrary desktop applications. Approaches rely on pixel screenshots or accessibility trees.

Computer Use (Anthropic) offers a pure‑pixel API that perceives and acts via screen captures, providing strong generality.

UFO (Microsoft) uses the Windows UI Automation API to read structured control information for precise interaction.

OmniParser trains a vision model to convert GUI screenshots into structured element representations, bridging pure vision and structured understanding.

UI‑TARS and UI‑TARS‑2 are models trained for GUI grounding; UI‑TARS‑2 adds multi‑turn reinforcement learning to improve error recovery.

OS‑Copilot is a unified framework supporting CLI commands, GUI actions, and API calls, illustrating a trend toward multimodal unified execution interfaces.

Mobile Agents

Address phone‑screen constraints, touch gestures, and fragmented app ecosystems.

AppAgent learns app interaction patterns to build a cross‑app UI knowledge base.

Mobile‑Agent extends visual GUI automation to Android and iOS, supporting diverse gestures.

Benchmarks

Agent benchmarks require a complete interactive environment so that the system finishes the task end‑to‑end, evaluating not only reasoning but also tool use, environment interaction, and multi‑step planning.

SWE‑bench (software‑engineering benchmark built from real GitHub issues and patches) improved from <15 % success in 2024 to >80 % in early 2026, reflecting rapid progress of autonomous programming agents.

Terminal‑Bench contains 89 high‑difficulty terminal tasks; even state‑of‑the‑art models score below 65 %.

WebArena / VisualWebArena use self‑hosted website instances to evaluate multi‑step web‑agent performance.

OSWorld is a cross‑platform desktop benchmark covering Ubuntu, Windows, and macOS.

AndroidWorld includes 20+ real Android apps for dynamic mobile evaluation.

TheAgentCompany simulates a realistic company environment where agents must combine programming tools, browsers, and communication platforms.

GAIA comprises 422 comprehensive problems requiring multi‑step reasoning, web browsing, and tool use, stressing open‑ended execution ability.

Open Challenges

Error accumulation in long‑horizon tasks : Small mistakes early in a chain amplify, reducing overall success rate.

Insufficient error‑recovery : Failures often trigger repeated attempts rather than strategic replanning.

Safety and trust : Direct control over real environments raises risks of mis‑operations, unauthorized access, and privacy leaks.

Evaluation vs. real‑world usability gap : Benchmarks capture controlled settings, but real user scenarios are messier and less predictable.

Conclusion

Terminal and GUI routes are complementary. Terminal agents excel at structured, programmable tasks, while GUI agents cover scenarios lacking API access. Future computer‑use agents are likely to combine both capabilities under a unified architecture, enabling cross‑environment, cross‑application collaboration.

GUILLMbenchmarkterminalagent researchcomputer-use agents
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.