Why AI Agents Stumble at Code and How a Harness Can Make Them Reliable
The article explains why large‑language‑model agents often lose context and violate architectural rules when generating code, and proposes a Harness framework that treats the repository as an operating system, adds layered linting, pre‑validation, automated verification, and cross‑model review to keep agents on track.
When an AI agent is asked to implement a feature it may write hundreds of lines of code, only to fail lint because it imported a configuration package that violates the project's architectural layering – a rule the agent never learned.
Root Cause: Invisible Context
Agents lack access to the repository‑wide facts that define architecture, naming conventions, and layer constraints. Prompt engineering cannot enumerate every implicit rule, and the limited context window quickly fills with diffs, errors, and logs, causing the agent to "forget" its original goal.
Solution Overview: Harness as the Agent’s Operating System
Harness treats the codebase itself as the only source of truth. All architectural decisions, layer constraints, and naming standards are versioned in the repository, typically under docs/ and referenced from a concise AGENTS.md navigation file (~100 lines).
Key Principles
Encode every rule in the repository – no external wiki or chat notes.
Keep the navigation file short; detailed rules live in separate markdown files under docs/.
Enforce a hierarchical import policy: lower layers (e.g., internal/types/) may not import higher layers ( internal/config/).
Shift human effort from writing code to designing the verification environment.
Concrete Project Layout
my-project/
├── AGENTS.md ← navigation (~100 lines)
├── docs/
│ ├── ARCHITECTURE.md ← layer rules
│ ├── DEVELOPMENT.md ← build/test/lint commands
│ ├── PRODUCT_SENSE.md ← business context
│ ├── design-docs/ ← component specs
│ └── exec-plans/ ← execution plans
├── scripts/
│ ├── lint-deps.* ← dependency‑layer checks
│ ├── lint-quality.* ← style rules (max 500 lines, no console.log/print)
│ ├── verify/ ← end‑to‑end checks
│ └── validate.py ← unified validation pipeline
├── harness/
│ ├── tasks/ ← task state & checkpoints
│ ├── trace/ ← execution logs
│ └── memory/ ← learned lessons
└── [business code...]Verification Pipeline
The executor follows a strict workflow: detect environment → load context → plan → human approval → execute → verify → complete. Verification consists of four stages: build → lint‑arch → test → verify Each stage aborts the next if it fails, preventing costly later‑stage fixes.
Pre‑validation
Before any structural change (e.g., creating a file in internal/types/ or adding a cross‑package import) the executor runs a lightweight script such as:
python3 scripts/verify_action.py --action "create file internal/types/user.go"
# ✓ VALID: internal/types/ is Layer 0, naming follows convention
python3 scripts/verify_action.py --action "import internal/core from internal/handler"
# ✗ INVALID: internal/handler (L4) cannot import internal/core (L3)
# Fix: handler should depend on core through interfacesCross‑Model Review
After code generation and mechanical checks, a second agent with a different model reviews the diff against architecture docs, naming, and performance considerations. Example prompt:
review_result = Agent(
description="Review: rate‑limiter implementation",
model="codex",
prompt="""
Review the following changes for:
1. Logic correctness and edge cases
2. Consistency with ARCHITECTURE.md
3. Naming clarity
4. Performance implications
Changes: {coding_result.diff}
Task context: {task_description}
"""
)Memory, Critic, and Refiner Loop
All validation failures are stored under harness/trace/failures/. A periodic Critic script analyses these logs, discovers recurring patterns (e.g., missing layer mapping for internal/cache), and produces recommendations. The Refiner then updates lint rules, improves error messages, and adds missing documentation, closing the feedback loop.
Trajectory Compilation
When a task pattern is executed successfully three times with identical steps (e.g., adding an API endpoint), Harness can compile it into a deterministic script such as make add-endpoint NAME=foo, allowing future executions to bypass LLMs entirely.
Practical Adoption
Even a minimal AGENTS.md plus a basic lint script yields immediate benefits for any project that can run shell commands. Larger codebases gain the most from a full six‑layer infrastructure, while solo developers can start with a simple navigation file and grow the system over time.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
