Agent Principles, Architecture, and Engineering Practices for Stable AI Systems
The article breaks down the core loop of AI agents, distinguishes agents from static workflows, and presents engineering practices—such as harness testing, context management, skill loading, tool design, memory handling, multi‑agent coordination, evaluation reliability, and security—that are essential for building robust, cost‑effective agents.
Agent Core Loop
The fundamental operation of an agent can be expressed in fewer than 20 lines of code and consists of four steps: perceive the current situation, decide, execute an action, and obtain feedback. This loop repeats until the task is complete and remains stable while extensions such as sub‑agents, context compression, and skill loading are added externally.
Workflow vs. Agent
If the execution path is hard‑coded, the system is a workflow; if each step is chosen dynamically by a large model, the system is an agent. Simple, predictable tasks are better served by workflows, while tasks that require flexibility benefit from agents.
Harness Engineering
Stability depends more on the surrounding testing, validation, and constraint infrastructure—called the "Harness"—than on model size. An OpenAI example involved three engineers writing over a million lines of code, nearly 1,500 pull requests, and achieving a ten‑fold speedup through disciplined harness practices. Key decisions include encoding knowledge in the codebase, turning specifications into machine‑executable linter and CI rules, and allowing flaky tests to be retried without blocking progress.
Context Management (Context Rot)
Long context windows dilute attention, a problem termed "Context Rot." The author proposes hierarchical context layers:
Resident layer : immutable rules such as identity definitions and absolute prohibitions.
On‑demand layer : domain knowledge and skill descriptions loaded only when needed.
Runtime layer : dynamic information like current time or user preferences.
Memory layer : cross‑session experience retrieved only when required.
Any deterministic logic that can be expressed as code, hooks, or tools should stay out of the model prompt.
Prompt Caching Cost Savings
Stable system prompts enable token‑level caching. When the input prefix exactly matches a previous request, the model can reuse the cached key‑value pairs, reducing downstream computation cost by up to 90% compared with frequently changing short prompts.
Skill Loading as Routing Conditions
Skills are loaded on demand, and their descriptions act as routing conditions that tell the model when to use them. Experiments showed that a skill without negative examples achieved 53% accuracy; adding negative examples (i.e., when the skill should not be used) raised accuracy to 85%.
Tool Design Evolution
Tool quality outweighs quantity. Five MCP tools can consume about 55,000 tokens of definition, occupying roughly 30% of the context window before any dialogue begins. Tool design evolved through three generations:
Fine‑grained API wrappers that required coordination of many tools for a single goal.
Goal‑oriented single‑purpose tools, each matching one agent objective.
Dynamic discovery where the agent searches for tool definitions on demand instead of loading all definitions upfront.
Most mis‑selections stem from inaccurate tool descriptions rather than model limitations.
Error Message Design
Well‑designed tools return structured error information indicating both the problem and a remediation step. Poor tools return only a generic message such as "Error: update failed," leaving the agent without guidance.
Memory System
Agents lack intrinsic memory; memory must be externalized. The author categorizes memory into four types:
Working memory : in‑context information needed for the current task.
Procedural memory : on‑demand operational procedures.
Situational memory : log files that record what happened.
Semantic memory : a dedicated MEMORY file containing facts the agent deems important.
ChatGPT’s implementation uses roughly 33 key facts plus a lightweight summary of the last 15 turns, avoiding heavyweight vector databases.
Long‑Running Tasks
For tasks that exceed a single session, the author proposes splitting responsibilities between two agents:
Initializer Agent : in the first round, decomposes the task into verifiable subtasks and writes a JSON checklist file.
Coding Agent : iteratively reads the checklist, implements a subtask, runs tests, updates the checklist, and persists progress to the file. Crashes can resume from the last saved state.
The principle is to externalize progress rather than keep it in the prompt.
Multi‑Agent Coordination
Effective multi‑agent systems require structured protocols rather than natural‑language agreements. Two modes are distinguished:
Commander mode : synchronous human‑agent loop where each decision is reviewed.
Orchestrator mode : asynchronous delegation where the human sets goals, multiple agents run in parallel, and the human reviews final outputs.
Cross‑validation—having an independent agent verify another’s result—prevents error amplification.
Evaluation System Pitfalls
When agent performance appears to degrade, the evaluation infrastructure is often at fault. Common issues include resource limits that kill processes, buggy scorers that misclassify correct answers, and mismatched test cases. An experiment showed that relaxing resource limits eliminated most errors while model scores remained unchanged. The recommended troubleshooting order is: check environment → check scorer → modify agent.
Security First
Agents with code execution, file access, or network capabilities must enforce:
Whitelist authorization (who can invoke the agent).
Workspace isolation (operations confined to a designated directory).
Audit logging (record every action).
Prompt injection attacks are mitigated by denying unnecessary tools, requiring explicit user confirmation for sensitive operations, and flagging untrusted inputs.
Implementation Order (OpenClaw Example)
The open‑source project OpenClaw demonstrates a rollout sequence:
Start with a single‑channel end‑to‑end flow; avoid abstracting multiple channels in the first version.
Establish security boundaries before adding new features.
Integrate memory early; without it, conversations longer than ~20 turns break.
Prioritize skill‑based knowledge over adding new tools.
Build a test suite as soon as the first failure occurs.
Common Pitfalls
Eight recurring anti‑patterns are identified:
Overly long system prompts that hide critical rules.
Tool proliferation leading to frequent mis‑selection.
Unverifiable task completion claims.
Unclear boundaries between multiple agents causing state pollution.
Lack of memory integration resulting in decision decay after long dialogs.
Missing testing framework, making regressions invisible.
Premature scaling to multiple agents, where coordination overhead outweighs parallelism benefits.
Relying on human discipline instead of enforced constraints.
Conclusion
Stable agents rely less on ever‑larger models and more on disciplined engineering: decoupled messaging, externalized state, layered prompting, memory integration, and robust security. Engineering scaffolding often yields greater gains than merely upgrading the model.
[1] https://x.com/HiTw93/status/2034627967926825175Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
