Artificial Intelligence 27 min read

AI Coding Needs Discipline: My Two‑Month Harness Framework Experience

The article analyzes why the bottleneck in AI‑assisted coding has shifted from model capability to workflow stability, introduces a three‑layer "harness" framework that externalizes discipline, details its evolution through four development phases, and presents a deterministic evaluation platform that quantifies the framework’s effectiveness.

Alibaba Cloud Developer

Jun 16, 2026

AI Coding Needs Discipline: My Two‑Month Harness Framework Experience

Core Insight

AI coding is no longer limited by model intelligence; the real challenge is the instability of the workflow. The author observes that models are "smart enough" but forgetful, and that stability must be supplied by an external engineering framework.

What Is harness ?

harness

is defined as a structured, executable, and testable framework that turns "what AI should do" into concrete, enforceable rules. Unlike ad‑hoc prompts, harness separates discipline (rules) from intelligence (model).

Key Principle: Prompt engineering is persuasive; harness provides hard constraints.

Three‑Layer Architecture

Entry Layer (persistent): CLAUDE.md + CLAUDE.local.md store role definitions, trigger rules, and a tiny static context (≤8 KB).

Atomic Rule Layer: A rules/ directory with seven single‑responsibility rules that encode every pitfall the author has encountered (e.g., mvn -am causing deadlock).

On‑Demand Context Layer: A context/ directory that is loaded only when a specific phase needs it (TDD guide, pre‑mortem template, etc.), keeping the main context small.

This design treats the main session as a thin dispatcher that reads state.json and decides which agent to invoke, while each agent runs in an isolated context.

Agent Design

Dispatcher: Routes tasks based on intent × risk classification.

Orchestrator: Synthesizes the three role agents’ outputs (requirement‑analyst, tech‑architect, quality‑guardian) and asks the user for confirmation.

Developer, Verifier, Deployer, Tester: Execute the actual steps (code, compile, test, deploy) with strict tool whitelists.

Design Rule: The main session never reads business code directly; it only follows dispatcher commands.

Evolution Stages

Copy‑and‑Paste Phase: Started with open‑source specs (e.g., oh‑my‑claudecode) but quickly hit context overflow and rule‑conflict issues.

Prompt‑Heavy Phase: Packed all workflow steps into a giant prompt; after three days the model ignored many rules and the context filled with code output.

Layered Reduction Phase: Split the prompt into three layers, reduced the persistent prompt to ≤8 KB, and moved deep content to on‑demand files. This restored the model’s ability to read code but introduced new long‑session context loss.

Agent‑Oriented Phase: Introduced a dispatcher‑driven agent architecture, externalized state to files, and achieved crash‑resilient continuation across days.

Evaluation Platform

The author built a deterministic Python evaluator that treats the harness as the system under test. It runs three independent tracks:

Full‑run evaluation ( /dev ): Executes the entire pipeline on real infrastructure, checking every gate (G1‑G8) and confirming deployment.

Query mode ( /eval ): Runs multiple versions and cases without side effects, producing a reproducible score.

Issue tracing ( /query ): Reads logs or trace IDs to locate failures without performing actions.

The scoring system combines seven dimensions (process completeness, artifact quality, code correctness, efficiency, security, evolvability, integration testing) with explicit weights (e.g., 22 % for completeness, 22 % for correctness). Scores are fully deterministic: the same run always yields the same hash, enabling reliable A/B comparisons.

“Prefer a reproducible coarse score over a drifting precise one.”

Key Findings

Externalizing state and using file‑based hand‑offs prevents context loss and allows cross‑day continuation.

Hard gate checks (G1‑G8) are far more predictive of success than model‑generated confidence.

Deterministic evaluation exposes self‑reported “honesty gaps” where the model claims success but compilation fails.

Future Directions

Integrate structured memory layers (e.g., VikingMem) to replace raw token storage.

Build a persistent knowledge graph of code dependencies to avoid repeated file scans.

Compare the current dispatcher‑file approach with high‑parallel workflow engines (e.g., Apache Burr) via A/B testing.

Overall, the article demonstrates a rigorous, engineering‑first methodology for making AI‑driven coding reliable, turning “discipline‑less AI” into a controllable, auditable process.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Workflow agent evaluation process engineering harness

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.