Why AI Agents Stumble at Code and How a Harness Can Make Them Reliable

The article explains why large‑language‑model agents often lose context and violate architectural rules when generating code, and proposes a Harness framework that treats the repository as an operating system, adds layered linting, pre‑validation, automated verification, and cross‑model review to keep agents on track.

Alibaba Cloud Developer

Apr 3, 2026

Why AI Agents Stumble at Code and How a Harness Can Make Them Reliable

When an AI agent is asked to implement a feature it may write hundreds of lines of code, only to fail lint because it imported a configuration package that violates the project's architectural layering – a rule the agent never learned.

Root Cause: Invisible Context

Agents lack access to the repository‑wide facts that define architecture, naming conventions, and layer constraints. Prompt engineering cannot enumerate every implicit rule, and the limited context window quickly fills with diffs, errors, and logs, causing the agent to "forget" its original goal.

Solution Overview: Harness as the Agent’s Operating System

Harness treats the codebase itself as the only source of truth. All architectural decisions, layer constraints, and naming standards are versioned in the repository, typically under docs/ and referenced from a concise AGENTS.md navigation file (~100 lines).

Key Principles

Encode every rule in the repository – no external wiki or chat notes.

Keep the navigation file short; detailed rules live in separate markdown files under docs/.

Enforce a hierarchical import policy: lower layers (e.g., internal/types/) may not import higher layers ( internal/config/).

Shift human effort from writing code to designing the verification environment.

Concrete Project Layout

my-project/
├── AGENTS.md               ← navigation (~100 lines)
├── docs/
│   ├── ARCHITECTURE.md    ← layer rules
│   ├── DEVELOPMENT.md      ← build/test/lint commands
│   ├── PRODUCT_SENSE.md    ← business context
│   ├── design-docs/        ← component specs
│   └── exec-plans/         ← execution plans
├── scripts/
│   ├── lint-deps.*         ← dependency‑layer checks
│   ├── lint-quality.*       ← style rules (max 500 lines, no console.log/print)
│   ├── verify/              ← end‑to‑end checks
│   └── validate.py          ← unified validation pipeline
├── harness/
│   ├── tasks/               ← task state & checkpoints
│   ├── trace/               ← execution logs
│   └── memory/              ← learned lessons
└── [business code...]

Verification Pipeline

The executor follows a strict workflow: detect environment → load context → plan → human approval → execute → verify → complete. Verification consists of four stages: build → lint‑arch → test → verify Each stage aborts the next if it fails, preventing costly later‑stage fixes.

Pre‑validation

Before any structural change (e.g., creating a file in internal/types/ or adding a cross‑package import) the executor runs a lightweight script such as:

python3 scripts/verify_action.py --action "create file internal/types/user.go"
# ✓ VALID: internal/types/ is Layer 0, naming follows convention
python3 scripts/verify_action.py --action "import internal/core from internal/handler"
# ✗ INVALID: internal/handler (L4) cannot import internal/core (L3)
#   Fix: handler should depend on core through interfaces

Cross‑Model Review

After code generation and mechanical checks, a second agent with a different model reviews the diff against architecture docs, naming, and performance considerations. Example prompt:

review_result = Agent(
  description="Review: rate‑limiter implementation",
  model="codex",
  prompt="""
  Review the following changes for:
  1. Logic correctness and edge cases
  2. Consistency with ARCHITECTURE.md
  3. Naming clarity
  4. Performance implications
  Changes: {coding_result.diff}
  Task context: {task_description}
  """
)

Memory, Critic, and Refiner Loop

All validation failures are stored under harness/trace/failures/. A periodic Critic script analyses these logs, discovers recurring patterns (e.g., missing layer mapping for internal/cache), and produces recommendations. The Refiner then updates lint rules, improves error messages, and adds missing documentation, closing the feedback loop.

Trajectory Compilation

When a task pattern is executed successfully three times with identical steps (e.g., adding an API endpoint), Harness can compile it into a deterministic script such as make add-endpoint NAME=foo, allowing future executions to bypass LLMs entirely.

Practical Adoption

Even a minimal AGENTS.md plus a basic lint script yields immediate benefits for any project that can run shell commands. Larger codebases gain the most from a full six‑layer infrastructure, while solo developers can start with a simple navigation file and grow the system over time.