Why Agent‑First Systems Fail and How Harness Engineering Fixes Them
The article analyzes OpenAI’s Harness Engineering approach, explains four systemic failure modes of LLM‑driven agents, and details five modular components—readable environment, task state machine, verification loop, architectural constraints, and loop detection—that together enable reliable, large‑scale agent development.
Background and Motivation
In February 2026 OpenAI published a blog post titled “Harness Engineering: Leveraging Codex in an Agent‑First World”. The post reports that the engineering team used Codex to generate one million lines of production code and 1,500 pull requests without any manually written code. In a parallel experiment, LangChain engineers kept the same model (gpt‑5.2‑codex) but changed only the way it was used, raising the Terminal Bench 2.0 score from 52.8 % to 66.5 % and moving the ranking from 30th to 5th. The key insight is that Agent = Model + Harness : the model defines the theoretical ceiling, while the harness determines how close the system gets to that ceiling.
Four Systemic Failure Modes of Agents
State loss across sessions : LLMs have no persistent memory; each session starts with a clean context window, so later sessions cannot know what earlier sessions did.
One‑shot greed : Agents try to complete the entire goal in a single pass, exhausting the context window and leaving a half‑finished codebase.
Premature completion : Agents claim a task is finished even though the functionality has not been verified, because their self‑evaluation relies on the same context that just produced the code.
Doom loop : Agents repeatedly try the same solution space without breaking out, leading to linear token consumption without progress.
Five Harness Components that Counter the Failures
Component 1 – Readable Environment (addresses Failure 1)
All state must be externalized. Instead of a monolithic AGENTS.md file, the document is split into a stable entry point and a hierarchy of focused sub‑documents:
AGENTS.md ← lightweight entry, stable, rarely changes
├── product-specs/ ← user stories + acceptance criteria (by feature)
├── design-docs/ ← architecture decisions (ADRs)
├── exec-plans/ ← current execution plan, high‑frequency updates
├── db-schema/ ← database schema, generated when possible
└── security/ ← security guidelines, manually maintainedThe entry file changes rarely, allowing agents to locate the latest sub‑documents without loading the entire tree. LangChain implements this via LocalContextMiddleware, which scans the directory and injects only the needed context for each session.
Component 2 – Task State Machine (addresses Failures 1‑3)
Task management is externalized as a JSON state machine. Each task includes an ID, title, specification, acceptance criteria, and a status field that defaults to fail (meaning “not yet passed”). Agents iteratively turn fail into pass after verification.
{
"id": "auth-001",
"title": "User email login",
"spec": "Support email+password login, return JWT on success, 401 on failure",
"acceptance_criteria": [
"POST /auth/login accepts email and password",
"Invalid credentials return { error: 'invalid_credentials' }",
"Successful token expires in 24 h, stored in httpOnly cookie"
],
"status": "fail"
}Design decisions:
Use neutral pending for “not started” and fail for “not passed” to enforce a verification mindset.
Acceptance criteria must be machine‑readable contracts, not human‑focused documentation.
Combine the JSON file with git log to reconstruct the current project state within ~30 seconds, eliminating hidden assumptions from previous agents.
Two agents collaborate:
Initializer Agent : Generates the readable environment, creates the JSON feature list, starts the development server, writes a progress summary, and makes the initial git commit.
Coding Agent : Reads the highest‑priority fail task, implements it, runs verification, updates the status to pass, commits, and updates the progress file.
Component 3 – Verification Loop (addresses Failure 3)
Self‑evaluation in the same context leads to premature completion. LangChain’s PreCompletionChecklistMiddleware injects a system message that forces the agent to run a full verification checklist before marking a task complete.
class PreCompletionChecklistMiddleware(AgentMiddleware):
def before_complete(self, state: AgentState) -> AgentState:
if not state.get("verification_done"):
state.inject(SystemMessage(
"Before marking complete, run the full verification checklist: "
"1) All acceptance_criteria tests pass "
"2) No regressions in existing tests "
"3) End‑to‑end flow verified"
))
state.set("verification_done", False)
return stateThe middleware can also attach before/after videos to a pull request, allowing reviewers to validate fixes without rebuilding the environment.
Component 4 – Architectural Constraints (addresses Failures 2‑3)
Agents tend to copy every pattern in the repository, including bad ones. To prevent technical‑debt amplification, architectural rules are encoded directly into the toolchain via pre‑commit hooks that enforce one‑way dependency flow, naming conventions, and structural tests.
# .git/hooks/pre-commit
#!/bin/sh
# Check dependency direction
npx check-deps --config .dep-rules.json || exit 1
# Enforce naming conventions
npx lint-names --strict || exit 1
# Verify architectural boundaries
go test ./cmd/check-arch/... || exit 1Strong typing (Go, TypeScript, Protobuf) provides free compile‑time architectural enforcement, moving constraints earlier in the development cycle.
Component 5 – Loop Detection (addresses Failure 4)
Agents can get stuck in repetitive attempts. LangChain’s LoopDetectionMiddleware tracks edit counts per file and intervenes when a threshold is exceeded, prompting the agent to change strategy.
class LoopDetectionMiddleware(AgentMiddleware):
def __init__(self, threshold: int = 5):
self.file_edit_counts: Dict[str, int] = {}
self.threshold = threshold
def after_edit(self, file: str) -> Optional[Intervention]:
self.file_edit_counts[file] = self.file_edit_counts.get(file, 0) + 1
if self.file_edit_counts[file] > self.threshold:
return Intervention(
f"You've edited {file} {self.file_edit_counts[file]} times. "
"Consider a different approach or ask for help."
)
return NoneFor long‑running cross‑session tasks, LangChain recommends re‑injecting the highest‑priority unfinished task into a clean context window at the start of each session (the “Ralph Loop” pattern).
How the Five Components Work Together
The readable environment supplies the foundational documents (feature list, architectural rules, verification criteria). The task state machine provides the JSON acceptance_criteria that feeds the verification loop. Architectural pre‑commit hooks become part of the verification feedback cycle. Loop detection safeguards the task state machine’s progress by preventing endless token consumption.
Together they answer the central question: How can a stateless, greedy, self‑evaluating LLM operate reliably on complex, long‑term projects?
Reference Implementations
Open‑source projects that implement the described middleware include:
LangChain DeepAgents – repository: https://github.com/langchain-ai/deepagents. Provides write_todos, LocalContextMiddleware, PreCompletionChecklistMiddleware, and LoopDetectionMiddleware.
AgentsMesh – a 52‑day case study with 960 k lines of throughput, demonstrating DDD‑layered architecture and a four‑layer feedback loop.
DeerFlow (ByteDance open source) – package deerflow-harness decouples the agent engineering layer from business logic.
A community‑maintained list of harness‑engineering resources is available at https://github.com/walkinglabs/awesome-harness-engineering.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
