Prompt, Context, Harness: Decoding the Three‑Layer Architecture of AI Agent Engineering

The article analyzes the evolution from Prompt Engineering to Context Engineering and finally Harness Engineering, explains why each layer is needed, provides concrete code examples, diagnostic scripts, and practical guidelines for building reliable AI coding agents.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
Prompt, Context, Harness: Decoding the Three‑Layer Architecture of AI Agent Engineering

Three Layers of AI Agent Engineering

Since 2023 the term "Prompt Engineering" has been ubiquitous, in 2025 the focus shifted to "Context Engineering", and by April 2026 the community repeatedly mentions "Harness Engineering". These three terms describe the same problem at increasing depths of abstraction; understanding their boundaries forms a highly practical cognitive framework.

Why the Shift Occurs

Most failures of AI‑generated code are not caused by the model itself—models can write code—but by the starting conditions and lack of self‑correction mechanisms such as requirement clarification, boundary checks, and result validation. HumanLayer’s engineering team spent over a year observing coding agents that ignored instructions, executed dangerous commands without confirmation, or entered infinite loops on simple tasks. Their conclusion: "This is not a model problem, it is a configuration problem." Smarter models simply receive harder tasks and exhibit the same failure patterns because the underlying issue is the nondeterministic nature of the system.

Prompt Engineering: Getting the Task Expression Right

Prompt Engineering remains essential: structured output, chain‑of‑thought, role‑setting, few‑shot examples, and iterative wording optimization are still valid techniques, but they become insufficient at the scale of multi‑step agents. Prompt Engineering solves the "expression" problem—how to phrase a request so that the model behaves correctly. It deals with the interface between human intent and model input, setting roles, tone, constraints, breaking complex requests into ordered steps, providing format‑matching examples, and repeatedly testing wording until the output stabilises.

Prompt Engineering cannot inject private knowledge bases, inform the model about recent code changes, maintain cross‑session memory, replace permission systems, tool availability, or error‑recovery logic. It works on a single request‑response pair and is ideal for drafting emails, generating summaries, or one‑off format conversions. When a task requires tool invocation, state tracking, or multi‑step collaboration, Prompt alone cannot sustain the system.

# Naive approach
prompt = "Fix the bug in my code"

# Prompt‑engineered approach
prompt = """You are a senior Python engineer reviewing a production bug.
Context:
- The bug causes a KeyError on line 47 of orders.py
- It only occurs during weekend batch processing
- The system uses PostgreSQL with a read replica
Your task:
1. Identify the root cause without changing any code
2. Describe the data condition that triggers the error
3. Propose a backward‑compatible fix
4. List any tests that should be added
Do not modify any files until I confirm your diagnosis."""

The engineered prompt is far superior, but if the model cannot access orders.py, run the test suite, or verify the fix, the quality of the prompt hits a hard ceiling.

Context Engineering: What the Model Sees When Making Decisions

Context Engineering operates at a higher level of abstraction. While Prompt Engineering asks "how to express the task", Context Engineering asks "what information environment the model should operate in". Anthropic defines the core challenge for agents with longer time spans and multi‑turn reasoning as "managing the entire context state: system instructions, tools, MCP servers, external data, message history". The context window is not just a prompt; it is all the information the model can attend to for each decision.

Context Engineering must solve three core problems:

Retrieval (RAG) : When required knowledge exceeds the context window, index the knowledge and retrieve the most relevant fragments at the right moment. This works well for document lookup or policy Q&A, but performs poorly for debugging where the signal is call‑chains, git blame, symbol definitions, and migration history—more akin to grep than vector similarity.

Tool Exposure : Without tools the LLM lacks current time, file I/O, command execution, or external API access. Deciding which tools to expose, how to describe them, and preventing tool overload is crucial. HumanLayer found that loading too many MCP servers quickly fills the context window and "enters the dumb‑down zone". A small, composable tool set (Read, Write, Grep, Glob, Bash) works better than a sprawling list.

Memory Management : LLMs are stateless; each session is a cold start. Context Engineering decides what stays in the active window, what is summarised, what is persisted, and what is retrieved on demand. Short‑term memory is the conversation buffer; long‑term memory is structured storage for user preferences, project rules, and cross‑session decision records. LangGraph’s memory system supports such persistent context, including user profiles and accumulated preferences.

ETH Zurich tested 138 markdown files (e.g., AGENTS.md, CLAUDE.md) injected into the system prompt and found that LLM‑generated versions actually harmed performance, increasing cost by over 20 % and consuming 14–22 % more inference tokens without improving success rates. The lesson: fewer, high‑signal files are better.

# CLAUDE.md — minimal, high‑signal example
## Project conventions
- All database access goes through /src/db/queries/, never raw SQL inline
- Use `pnpm run typecheck` after every change
- Never modify migration files after they've been committed
## Linear workflow
- Fetch issues: `linear get-issue ENG-XXXX`
- Update status: `linear update-status ENG-XXXX "in dev"`
## Verification
Run `pnpm run typecheck && pnpm test --changed` before stopping.
Failures are required reading - do not ignore them.

Sub‑agents act as context firewalls. Chroma research shows model performance degrades as context length grows, especially when semantic similarity between the problem and context is low. Sub‑agents allocate a fresh, small, highly relevant context window for each sub‑task, while the parent agent only receives the compressed result, isolating tool calls, grep output, and file reads.

Cost can be tuned: the parent session uses a high‑cost model (e.g., Opus) for orchestration, while isolated sub‑agents run cheaper, faster models (e.g., Sonnet or Haiku).

# Conceptual sub‑agent scheduling pattern
def trace_request_flow(parent_agent, service_name):
    sub_agent_prompt = f"""
    Trace the request flow for {service_name}.
    Return only:
    1. Entry point file:line
    2. Key middleware in order
    3. DB queries triggered
    Cite sources as filepath:line.
    Do not include intermediate steps in your response.
    """
    # Sub‑agent runs in an isolated context window
    result = dispatch_sub_agent(sub_agent_prompt, tools=["read", "grep"])
    # Parent agent sees only the condensed result
    return result

Hooks create deterministic feedback loops. A verified pattern is to run type‑check and linter after every agent stop; only errors are fed back to the agent, keeping successful runs silent.

#!/bin/bash
cd "$CLAUDE_PROJECT_DIR"

# Run typecheck and formatter in parallel
OUTPUT=$(bun run --parallel \
    "biome check . --write --unsafe || biome check . --write --unsafe" \
    "turbo run typecheck" 2>&1)
if [ $? -ne 0 ]; then
    echo "$OUTPUT" >&2
    exit 2  # Reactivate agent to fix the error
fi
# Success: stay silent, do not pollute context.

Verification layers act as back‑pressure: if code coverage drops, a hook alerts; TypeScript errors prevent task completion; the gap between "probably works" and "provably works" is bridged by these feedback mechanisms, which HumanLayer found to yield the highest return on Harness investment.

# Diagnostic framework: which layer is failing?
def diagnose_agent_failure(failure_type):
    if failure_type == "wrong_output_format":
        return "Prompt Engineering - constrain output format"
    elif failure_type == "hallucinated_fact_about_codebase":
        return "Context Engineering - add retrieval or inject relevant file"
    elif failure_type == "wrong_tool_selected":
        return "Context Engineering - improve tool descriptions or reduce tool count"
    elif failure_type == "drifts_on_long_task":
        return "Harness Engineering - add sub‑agent isolation or loop detection"
    elif failure_type == "destructive_action_taken":
        return "Harness Engineering - add permission hooks and approval gates"
    elif failure_type == "silent_failure_no_error_surfaced":
        return "Harness Engineering - add back‑pressure verification and hooks"
    elif failure_type == "good_code_regresses_unknown":
        return "Harness Engineering - add entropy management and documentation linting"

Practical Harness Configuration for a TypeScript Monorepo

## Production TypeScript monorepo Harness checklist
### Prompt layer
- [ ] System prompt defines role, scope, and prohibited actions
- [ ] Output format is constrained for structured responses
### Context layer
- [ ] AGENTS.md under 60 lines, universally applicable only
- [ ] Skills for debugging, refactoring, PR creation, dependency auditing
- [ ] MCP servers: only 2‑3 active at a time, disable unused ones
- [ ] Memory: conversation buffer + structured long‑term rules store
### Harness layer
- [ ] Pre‑commit hook: biome + typecheck
- [ ] PostToolUse hook: surface linter errors to agent on every file write
- [ ] Stop hook: run changed test files only, return errors
- [ ] Coverage hook: alert if coverage drops below threshold
- [ ] Loop detection: flag if same file edited 3+ times in same session
- [ ] Sub‑agent patterns defined for research, codebase tracing, QA
- [ ] Escalation rule: if blocked for 3+ tool calls, stop and ask

Missing any of these layers yields a deficient agent: a Context‑only agent has sufficient information but no feedback; a Harness‑only agent has feedback but lacks the necessary information.

Best‑Practice Recommendations (Based on 2026‑03 Experience)

Start with a minimal AGENTS.md (≤ 60 lines) containing only universally applicable constraints; generated files tend to degrade performance.

Add a verification hook immediately; running typecheck or lint after each agent stop provides the highest leverage.

Identify the two most used MCP tools and disable the rest; tool description bloat is a common cause of context saturation.

Load a Skill only after the same failure occurs twice; ETH Zurich found premature loading harms performance.

Delegate tasks requiring > 15 tool calls to a Sub‑agent; this prevents context corruption and keeps the parent thread coherent.

Treat git as the agent’s native memory—small, semantically clear commits support queries and improve Harness reliability.

Impact and Outlook

By the end of 2026, base models become increasingly commodified; Harness engineering becomes the differentiator. LangChain improved its coding benchmark by 14 percentage points without changing the underlying model by redesigning the Harness. OpenAI built a million‑line production app with zero human‑written code, where engineers focused solely on Harness design. Stripe’s internal Minions system generates over 1,000 merged PRs per week, with Harness handling test execution, CI verification, style compliance, and documentation updates.

The core skill shift is from "how to write a Prompt" to "how to design an environment where AI reliably does the right thing"—requiring systems thinking, architecture, observability, and well‑defined stop conditions.

Prompt Engineering provides better questioning, Context Engineering supplies better information, and Harness Engineering delivers a trustworthy system that can be deployed for real work.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI agentsLLMPrompt EngineeringAgent architecturecontext engineeringHarness Engineering
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.