Artificial Intelligence 36 min read

Production‑Ready Agent Harness: 7‑Layer Architecture for Scalable AI Agents

The article presents Agent Harness, a production‑grade AI agent framework built on a seven‑layer pyramid that addresses stability, tool safety, cost, hallucination, autonomous decision‑making, multi‑agent collaboration, work‑tree isolation and observability, and validates each layer with real‑world case studies and concrete benchmarks.

Linyb Geek Road

May 27, 2026

Production‑Ready Agent Harness: 7‑Layer Architecture for Scalable AI Agents

1. Core Execution Engine – Stability First

Typical agent tutorials use a simple while True loop that quickly leads to dead‑loops, crashes, token bloat and uninterruptible runs. The article shows how these issues caused daily failures and huge bills in production.

while True:
    response = llm.chat(messages)
    if response.has_tool_call():
        result = execute_tool(response.tool_call)
        messages.append({"role":"tool","content":result})
    else:
        return response.content

The proposed solution is a double‑loop engine with a fast execution loop (like a CPU’s micro‑code) and a slow “thinking” loop that triggers every three steps or on errors to re‑plan, audit and break dead‑loops. This change raised task success from 60 % to over 90 % and eliminated endless cycles.

Additionally, a break‑point‑resume mechanism persists the full state (context, step, intermediate results) after each step, allowing a four‑hour job to resume after a server reboot without re‑starting from scratch.

To avoid vendor lock‑in, the engine abstracts model calls behind a Provider layer, enabling cheap models for simple steps and expensive ones for complex reasoning, cutting model costs by more than 50 % while providing automatic fail‑over when a provider is unavailable.

2. Tool System – Safety First

Tools are the agent’s hands and feet; misusing them can be disastrous. The author recounts two real incidents: an rm -rf src/* command that wiped a codebase and an uncontrolled API loop that generated a ¥3000 bill.

To prevent such accidents, tools are classified into five risk levels (L0‑L4) with corresponding execution policies:

L0 (no risk): read‑only actions, auto‑executed.

L1 (low risk): file creation, auto‑executed with logging.

L2 (medium risk): file modification, requires a diff preview and auto‑exec after 10 s of no objection.

L3 (high risk): system commands, delete files, requires manual approval.

L4 (critical): disk format, database drop, absolutely blocked.

All new tools must be evaluated and assigned a level before deployment. This policy blocked >90 % of tool‑related incidents.

Path safety is enforced with a simple Java validator that rejects any path escaping the designated work directory:

public void validatePath(String path, String baseDir) {
    File target = new File(baseDir, path).getCanonicalFile();
    if (!target.getPath().startsWith(baseDir)) {
        throw new SecurityException("路径超出工作目录范围：" + path);
    }
}

The framework also adopts the Model Context Protocol (MCP) as a universal USB‑like standard for tools, allowing agents to plug‑in community‑maintained utilities (database access, GitHub ops, email, web crawling, cloud APIs) without writing adapters.

3. Context Engineering – Cutting Token Costs by 52 %

Analysis of early agents showed that >60 % of tokens per task were redundant (repeated system prompts, stale history, noisy tool logs). Because context length grows linearly with steps, long‑running jobs quickly explode in cost.

The solution is a hierarchical compression pipeline that scores each message and applies three compression levels:

L0 (100+ pts) : keep unchanged.

L1 (70‑100 pts) : light summarisation.

L2 (30‑70 pts) : heavy summarisation to core points.

L3 (<30 pts) : discard.

After enabling this pipeline, average token consumption dropped from 100 % to 48 %, halving the cost while preserving success rates.

Session isolation is also enforced: each user or tenant gets an independent session stored in a database, with LRU eviction for inactive sessions and automatic persistence for long‑running jobs.

4. Memory System – Reducing Hallucinations to <5 %

Standard RAG suffers from chunk boundary loss, redundant context and hallucinations, yielding ~70 % accuracy. The author built a three‑layer memory architecture :

L1 Short‑term : in‑memory current session.

L2 Mid‑term : per‑user persisted DB of preferences and history.

L3 Long‑term : global vector store of vetted knowledge.

Beyond RAG, a proprietary knowledge compilation pipeline converts raw documents into structured QA units with confidence scores and source references, e.g.:

{
  "id":"K-001",
  "question":"员工的年假有多少天？",
  "answer":"员工入职满1年享有5天年假，每增加1年工龄增加1天，最多15天。",
  "alternative_questions":["年假天数","年休假规定","每年可以休多少天年假"],
  "source":"员工手册第3.2节",
  "confidence":0.98,
  "review_status":"APPROVED",
  "tags":["人事","考勤","假期"]
}

This eliminates the need for the LLM to synthesize answers on the fly, dropping hallucination rates from 30 % to <5 % and raising end‑to‑end accuracy above 95 %.

5. Autonomous Decision Engine – OPEA Loop

Agents evolve from “tool users” to “colleagues” by managing their own goals. The framework defines four goal tiers (Vision → Goal → Task → Action) and lets an agent decompose a high‑level vision into concrete actions dynamically.

The execution follows an OPEA cycle:

Observe : assess current state, progress, problems.

Plan : decide next tool, strategy, or re‑plan.

Execute : run the chosen action.

Reflect : evaluate outcome, adjust plan, or request human help.

A code‑repair agent using this loop fixed hundreds of production bugs automatically.

6. Multi‑Agent Collaboration – Auction, Voting, Arbitration

Single agents hit a capability ceiling. The system introduces a team of specialised agents (code writer, tester, documenter, architect) that communicate via a shared protocol.

Task allocation uses a auction : the master broadcasts a task, each sub‑agent scores its suitability, and the highest‑scoring agent wins.

When a collective decision is needed, agents vote and the majority wins; ties trigger elimination of the lowest‑vote and re‑voting.

If disagreement persists, a neutral arbitration agent** steps in, reviews both arguments and issues a binding decision.

In a code‑refactor benchmark, a single agent took 7 h 20 min with three failures, while a five‑agent team completed the same job in 1 h 45 min with zero errors – a >4× speedup and higher quality.

7. Work‑Tree Isolation – Safe Sandboxes

To prevent agents from corrupting the host environment, each task runs in an isolated work‑tree with three layers of isolation:

File‑system : dedicated directory, path checks block "../" or absolute paths.

Process : separate process group per work‑tree.

Network (optional) : Docker‑based network namespace.

Work‑trees are pooled, auto‑cleaned, snapshot‑able for rollback, and have resource quotas (CPU, memory, disk). This guarantees that even a malicious or buggy agent cannot affect the main system.

8. Observability – Logs, Metrics & Traces

Production agents need full observability. The framework records for every LLM call: user, task, model, input/output token counts, latency, cost, and success flag. Aggregations answer questions like “today’s total spend”, “cost per scene”, “top spenders”, and trigger budget alerts (e.g., >120 % daily budget).

Full‑stack tracing captures each step’s tool call, duration, token usage and input/output, enabling instant root‑cause analysis when a job fails.

A Grafana dashboard displays key health indicators: active agents, queue length, success rate, average latency, model‑wise metrics, work‑tree count, resource utilisation, error and alarm rates.

9. Real‑World Cases

Case 1 – Local Literature Review CLI : scans local PDFs/Word files, performs cross‑document QA, auto‑generates outlines and de‑duplicates papers. All processing stays on‑premise, preserving unpublished research.

Case 2 – Medical Record QA Agent : combines rule‑based checks, LLM semantic validation and a medical knowledge base to audit hundreds of records per day. Manual review drops from 15‑20 min per record to 30 s, with higher accuracy.

10. Seven Practical Recommendations for Teams

Focus on concrete business problems, not AGI hype.

Invest 80 % of effort in engineering details (stability, cost, safety, observability, memory) rather than prompt tweaking.

Make safety the top priority – permission checks, work‑tree isolation, manual approvals, anomaly detection.

Build observability before going live.

Reuse open‑source tools where possible, but own the core engine, memory and scheduler.

Keep humans in the loop for final decisions and edge cases.

Start now – current models already solve many real problems; early adopters gain the biggest advantage.

Agent Harness is positioned as the "operating system" for production AI agents, turning them from experimental toys into reliable, auditable, cost‑effective workers that can be safely deployed across industries.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Architecture Memory management AI agents Observability cost optimization Production Systems Tool Safety

Written by

Linyb Geek Road

Tech notes

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Core Execution Engine – Stability First

2. Tool System – Safety First

3. Context Engineering – Cutting Token Costs by 52 %

4. Memory System – Reducing Hallucinations to <5 %

5. Autonomous Decision Engine – OPEA Loop

6. Multi‑Agent Collaboration – Auction, Voting, Arbitration

7. Work‑Tree Isolation – Safe Sandboxes

8. Observability – Logs, Metrics & Traces

9. Real‑World Cases

10. Seven Practical Recommendations for Teams

Linyb Geek Road

How this landed with the community

Was this worth your time?

0 Comments

3. Context Engineering – Cutting Token Costs by 52 %

4. Memory System – Reducing Hallucinations to <5 %