Artificial Intelligence 26 min read

Why the Real Power of AI Coding Tools Lies Beyond the Model

The article explains how Cursor's cloud agents use isolated VMs, parallel orchestration, video artifacts, and multi‑model routing to deliver end‑to‑end code, showing that the surrounding harness—not the LLM itself—determines productivity, cost, and reliability, and compares this approach with Claude Code and OpenAI Codex.

DevOps Coach

Apr 21, 2026

Why the Real Power of AI Coding Tools Lies Beyond the Model

Cursor Cloud Agent Architecture

Cursor cloud agents run autonomous coding tasks inside isolated Linux VMs that contain a full development environment. Each user can run 10‑20 agents in parallel, each on a separate branch without resource contention.

Five‑Layer Harness

Interface Layer : entry points such as Slack, GitHub, Linear, mobile devices, and the IDE.

Orchestration Layer : creates a detailed plan, selects models, decides the number of parallel agents, and dispatches sub‑agents.

Execution Layer : runs the plan inside a dedicated VM; the VM includes filesystem, terminal, browser, and a running instance of the application.

Verification Layer : agents prove their output by interacting with the UI, recording a 30‑second video, taking screenshots, and collecting logs.

Output Layer : bundles code, video, screenshots, logs, and a clean commit history (rebase, conflict resolution, squash) into a pull request.

Models sit below the five layers; any LLM (GPT‑5, Claude, Gemini, or Cursor’s Composer 2) can be swapped without changing harness behavior. The harness can route different sub‑tasks to different models and run multiple models on the same task in parallel (“race” mode) to select the best result.

Codebase Integration

Before writing code, an agent reads the repository using a custom embedding model for large‑scale code retrieval. Sub‑agents concurrently index the front‑end component tree, map API routes, and read the database schema, producing a contextual map that guides subsequent edits.

For new repositories, the onboarding flow at cursor.com/onboard configures the environment, installs dependencies, and records a demo video to verify the agent’s understanding.

VM Sandbox

Each cloud agent receives its own Linux VM with full filesystem, terminal, browser, and a running instance of the target application. The isolation enables true parallelism: 10‑20 agents can operate simultaneously on different branches.

In March 2026 a self‑hosted version was released, keeping code, build artifacts, and secrets inside the customer’s network while preserving the same toolchain.

Planning and Model Routing

Workflows split into two phases:

The developer collaborates with a local model to create a detailed plan (features, affected files, acceptance criteria).

The plan is handed to the cloud agent for execution while the developer proceeds to the next task.

Routing decisions occur at the harness level:

GPT‑5 for long‑running tasks.

Claude for reasoning‑heavy subtasks.

Gemini for large context windows.

Composer 2 for routine coding.

“Race” mode dispatches the same problem to multiple models in parallel and selects the best output.

Computer Use & Verification

Since April 2026 agents can launch a browser inside their VM, build the app, navigate to localhost, and interact with the UI (click buttons, fill forms, verify rendering). If a verification step fails, the agent returns code, fixes the issue, rebuilds, and retests until the change passes. The final PR includes a 30‑second video, screenshots, and logs as artifacts.

PR as Proof

The generated pull request contains not only a diff but also video, screenshots, logs, and a clean commit history (including rebase, conflict resolution, and squash). Reviewers watch the short video instead of mentally simulating the code, shifting review focus from “does it work?” to “does it achieve the intended outcome?”.

Mapping to Harness Layers

Interface Layer : agents can be triggered from Slack, GitHub, Linear, the web UI, mobile, or desktop IDE. Cursor 3 replaces the chat panel with a persistent “Agents Window” that shows task cards with statuses (Planning, Executing, Reviewing, Done) and diff links.

Orchestration Layer : implements planning, model routing, race mode, and sub‑agent dispatch. It decides *how* work is done and *which* model executes it.

Execution Layer : provides VM isolation and the self‑hosted option, deciding *where* code runs and *who* controls the environment.

Verification Layer : captures video, screenshots, and logs, turning code changes into verifiable evidence.

Output Layer : presents the artifacts as a PR, closing the loop between autonomous work and developer review.

Platform Capabilities Built on the Harness

Automation : event‑driven agents can be scheduled or triggered by external tools. Automation currently does not support UI interaction, so visual verification must be done by a non‑automated agent.

Bugbot Autofix : when a bug is detected in a PR, a cloud agent tests a fix and submits it. Over 35% of Bugbot suggestions are merged; the fix‑rate rose from 52% to 76% in six months, and the average number of issues found per run roughly doubled.

Plugin Ecosystem : 30+ plugins from partners (Atlassian, Datadog, GitLab, Glean, Hugging Face, monday.com, PlanetScale) expose MCPs that agents can call, enabling cross‑service workflows (e.g., read from Jira, write to Datadog, query PlanetScale).

Cursor 3 “Glass” IDE : launched 2 April 2026, rebuilt around agent orchestration. It supports a multi‑repo layout and a design mode where natural‑language edits to UI components trigger code changes with live preview.

Design Choices and Trade‑offs

Success Factors

Video artifacts accelerate review up to 10× because reviewers validate intent rather than code diffs.

Parallel agents multiply throughput; ten agents each saving 30 minutes on a defined task yield five hours saved per session.

Multi‑surface access lets developers launch agents from Slack, review PRs on mobile, and assign tasks from GitHub issues.

Self‑hosted deployment removes the “code cannot leave the network” barrier for enterprises.

Current Limitations

Implicit Deletion : placeholder comments like // ... existing code ... can silently delete real code; reviewers must scrutinize diffs.

Compute Consumption : heavy agentic workflows quickly exhaust quota; a complex parallel session can cost several dollars per minute (Claude Sonnet ≈ 2.4× Gemini per request). Each model call costs ~ $0.04.

Task Scoping : overly broad tasks cause massive changes; overly narrow tasks waste resources. Medium‑sized, well‑defined tasks work best (e.g., add rate‑limiting middleware to /api/auth/login).

Legacy Codebases : agents thrive on modern, well‑tested repos; monoliths with inconsistent conventions reduce success rates.

Agent Recovery : error recovery adds friction compared to conversational chat models; Cursor 3 improves this with derivable and restartable task cards, but the experience remains less smooth than pure chat.

Shadow Code : enterprises worry about AI‑generated logic they cannot audit; Cursor provides an AI code‑tracking API and audit logs, but governance must be enforced by the team.

Comparison with Claude Code and OpenAI Codex

All three tools aim to let AI deliver code autonomously, but their harnesses differ dramatically.

Claude Code

Terminal‑first design; code stays on the local machine, no cloud VMs.

Worktree isolation enables parallel branch handling on a single host.

Uses Opus with a 200k‑token context window for deep reasoning.

Lacks video artifacts, cloud VM sandbox, and GUI orchestration.

Output is a diff and conversation log; trust model is “local‑first”.

OpenAI Codex

Runs tasks in cloud sandboxes triggered via web UI or GitHub issues.

Linear, single‑task flow without parallelism.

No video proof or model‑race capability.

Simple harness: read issue → create branch → implement change → open PR.

Run‑time Harness Comparison

Cursor excels when teams need parallel processing of >10 well‑defined tasks, visual verification, and a self‑hosted option. Claude Code excels for deep, single‑task reasoning on local machines. Codex excels for minimal‑overhead issue‑to‑PR automation.

Pricing Model

Cursor uses a credit‑based billing model. A $20 professional plan provides roughly 225 Claude Sonnet calls, 550 Gemini calls, or 500 GPT‑5 calls. Each agentic task may invoke multiple models; each call costs about $0.04. A five‑agent parallel session can spend several dollars in minutes. The cost reflects platform services (VM infrastructure, code indexing, artifact pipelines, plugin ecosystem, orchestration) rather than raw model API fees.

Summary

Cursor cloud agents execute autonomous coding inside isolated VMs, attach video proof to PRs, and support parallelism, model routing, and self‑hosting. The surrounding five‑layer harness—interface, orchestration, execution, verification, output—determines productivity, cost, and trust more than the underlying LLM. Understanding this harness helps teams choose a toolchain that matches their trust model, parallelism needs, and workflow preferences.

AI agents software development Cursor OpenAI Codex cloud orchestration Claude Code comparative analysis

Written by

DevOps Coach

Master DevOps precisely and progressively.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.