Artificial Intelligence 21 min read

Rethinking Harness Engineering: Designing Deletable Workspaces for Real‑World Agents

The article analyzes Harness Engineering by breaking down the five layers of Agent systems—Model, Tool, Skill, Sub‑agent, and Harness—showing how to design a workspace that not only runs agents but also enables verification, hand‑off, correction, and the disciplined removal of outdated constraints.

Architect

Jun 9, 2026

Rethinking Harness Engineering: Designing Deletable Workspaces for Real‑World Agents

First Split the Five Layers

Agents are not just models; they are systems that continuously act toward a goal. The author extracts the subject from the sentence

Agent 不是一个模型，而是一个能围绕目标持续行动的系统。

and contrasts ordinary chat models ("you ask, it answers") with agents that decide the next step, act, observe feedback, and repeat.

Read a code repository and locate a bug;

Modify files based on an issue and run tests;

Reproduce a page problem in a browser;

Inspect logs, traces, change configuration, then verify the result;

Fact‑check statements in a technical article.

These tasks cannot be completed by a single answer; they require a coordinated set of Model, Tool, State, Validation, and Permissions. Focusing only on the model or only on tools misses half the picture.

Model Drives Reasoning

The Model is the inference core: it understands goals, reads context, generates the next action, or judges result plausibility. It does not directly execute external actions. It may say "I want to read this file" or "run this test", but the actual file read, command execution, API call, and result injection are performed by the surrounding system.

Products such as Claude Code, Codex, Cursor, and Hermes Agent rely on model capabilities, yet the observable "can it get work done" depends heavily on the surrounding infrastructure:

Visibility of files;

Safe tool invocation;

Knowledge of project startup procedures;

Ability to run tests and understand failures;

Recording of performed actions;

Safety checks before dangerous operations.

The model is like an engine; a car needs more than an engine to drive.

Hand and Craftsmanship

Tools are the most visible layer—they are the agent’s hand reaching into the external world (file system, shell, browser, database, search, code interpreter, internal APIs). The model only expresses intent; the external system actually performs the call, which introduces side‑effects.

Risk varies: reading files is low risk, while deleting data, sending requests, changing configuration, or committing code carries higher risk. Thus tools also define permission boundaries.

Skill is not a single action but a reusable method for doing something. Examples include:

Diagnosing a front‑end page issue;

Fact‑checking a technical article;

Handling a database migration;

Reviewing security‑sensitive code;

Breaking a long task into plan, execution, and acceptance.

Skills require steps, experience, checkpoints, and output formats. The author prefers to view Tool as the "hand" and Skill as the "craftsmanship".

Division Is Not Magic

Sub‑agents have been popular lately. While they suggest parallelism (one sub‑agent searches, another writes code, another runs tests, another reviews), they introduce management overhead. A Sub‑agent is essentially a task handed to another agent that has its own context, tools, goal, and output.

Key questions for Sub‑agents:

Is the sub‑task sufficiently independent?

Can the output be verified?

Can the main agent or a human merge and prune the results?

If these are satisfied, Sub‑agents behave like engineered division; otherwise they merely multiply chaos.

Harness Runtime

Harness governs how the whole agent runs. It includes:

What context the model sees at each step;

Which tools are allowed and which need confirmation;

Where task state is stored;

When to trigger tests;

How failure results are fed back to the model;

Which actions are dry‑run only;

How logs, screenshots, traces, and commits are recorded;

Criteria for stopping;

How the next round of agents takes over.

A useful distinction: Scaffolding tells "how to think", Harness tells "how to run".

Effective Harness not only starts the agent but also ensures the run can be inspected, handed off, and corrected—something ordinary tool‑calling frameworks lack.

Look at the Site

OpenAI’s Harness Engineering article describes an internal product where all code is generated by Codex. The author notes that a massive AGENTS.md file quickly becomes a burden: it fills the context window, becomes stale, and obscures which constraints are truly critical.

They replaced the monolithic file with an entry‑map, moving detailed knowledge into structured docs, execution plans, quality records, specs, and design docs—similar to a newcomer onboarding guide that points to the right place instead of dumping everything on one page.

Anything the agent cannot see effectively does not exist. Implicit knowledge (Slack consensus, meeting decisions, senior engineer experience) must be materialized in repositories, scripts, tests, logs, or queryable tools to influence future runs.

Key takeaway: an Agent‑first codebase starts by making the system readable—not just for humans but for agents.

Front and Back Fit

ThoughtWorks and Martin Fowler split Harness into Guides (pre‑action guidance) and Sensors (post‑action feedback). Guides include AGENTS.md, Skills, architecture docs, API specs, type info, task templates, and acceptance criteria—reducing guesswork before the agent acts.

Sensors include tests, type checks, linters, structural rules, logs, browser screenshots, traces, and code‑review agents—providing real‑world validation after the agent finishes.

Only Guides can become "many rules with unknown usefulness"; only Sensors can cause the agent to repeatedly hit walls. Harness is therefore a set of tightly coupled controls rather than a single component.

State Handover

Anthropic’s long‑task Harness article shows that agents lose context after the first round fills the window. Their solution: the initializer agent creates an environment, a feature list, a progress file, a git repo, and a startup script. Subsequent coding agents read the current directory, git log, progress file, and feature list, start the service, run basic end‑to‑end checks, and then pick an unfinished feature to continue.

The important insight is not the JSON format but turning a long task into a hand‑over‑able state. A later agent can see:

Which features are still failing;

What test steps define "pass";

Recent commits;

Whether the application can start;

Progress left by the previous round;

The specific small goal for this round.

Long context helps, but true long‑term state management is about clear, up‑to‑date status rather than sheer document size.

Too Thick Is Slow

Vercel’s article on removing 80% of tools from an internal text‑to‑SQL agent illustrates that a thinner harness can improve success rate from 80% to 100% while reducing latency, steps, and token usage. The reason is that the previous harness made too many choices for the model; letting Claude read raw files (using grep, cat, ls) proved more natural.

The lesson: a good Harness is not necessarily thicker; it should avoid over‑constraining the model with unnecessary intermediate tools.

Small Process Start

For teams wanting to try Harness Engineering, the author recommends starting with a small, low‑risk workflow rather than building a full‑blown Agent platform. Example processes include:

Technical article fact‑checking;

Small bug fixes;

Documentation‑code consistency checks;

Configuration drift scans;

Test‑failure attribution;

PR risk pre‑screening.

All share visible results, reproducible failures, and easy permission control. The author uses a tiny task card to surface the essential questions:

目标：这次要解决什么
边界：哪些明确不做
输入：代码、文档、日志、issue、截图
可用工具：只读还是可写，哪些工具需要确认
验证：要跑哪些命令，留下什么证据
状态：候选、已验证、已提交怎么区分
停止：做到什么可以停，遇到什么需要停
回写：失败经验写回哪里

If these items are unclear, the agent may run, but review, merge, rollback, and hand‑off become painful. Running a few rounds reveals which constraints to solidify, which documentation to promote to an entry map, which dependency checks to add, which verification templates to create, and which stale rules to delete.

Final

Model, Tool, Skill, Sub‑agent, and Harness are each understandable in isolation, but in real engineering they quickly intertwine. The model proposes actions; tools cause side‑effects; skills embed team experience; sub‑agents introduce division and merging challenges; Harness stitches context, state, feedback, and permissions together.

Thus Harness Engineering is not a brand‑new term but a re‑exposure of classic software‑engineering problems in the Agent era: organizing context, committing state, feeding back into the next round, sealing permissions, continuously cleaning quality, deciding where human judgment belongs, and knowing when to delete obsolete constraints.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Tool Integration Agent Skill Management Sub-Agent harness engineering

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.