Why Raw AI Models Fail and How Harness Turns Them Into Powerful Agents
The article explains the four fundamental shortcomings of raw large language models—no memory, no code execution, outdated knowledge, and no workspace—and shows how a six‑component Harness (file system, Bash + sandbox, AGENTS.md memory, web search + MCP, context engineering, and orchestration + hooks) systematically resolves each issue to make AI agents practical and reliable.
Why a bare LLM is insufficient for engineering tasks
Large language models (LLMs) excel at generating text, but when used as a "bare" engine they suffer from four fundamental limitations:
No cross‑session memory : after each request the model’s context is discarded, so any background information must be repeated.
No code execution : the model can emit source code but cannot compile, run, test or debug it.
Stale knowledge : the model’s training data has a cutoff date; it cannot answer questions about APIs, security patches or product releases that appeared after that date.
No workspace : there is no file system, project structure or dependency manager, making it impossible to perform multi‑step, engineering‑grade workflows.
These gaps prevent a bare model from reliably handling complex, multi‑stage software development or operations tasks.
Harness: the engineering layer that turns a model into an Agent
"Harness" is the collection of infrastructure and orchestration components that surround a model, providing memory, execution, up‑to‑date perception and workspace capabilities. The six core components are:
1. Persistent File System
A hierarchical file system acts as external memory that extends the limited token window of the model. It stores source files, intermediate artifacts, documentation and version history, enabling:
Long‑term storage of code and data that can be loaded on demand.
Shared "whiteboard" for multiple agents to read/write and collaborate.
Git‑backed version control, allowing rollback, branching and audit of every change.
Typical layout follows conventional project conventions (e.g., src/, tests/, README.md) so the agent can navigate the directory structure with minimal cognitive load.
2. Bash + Secure Sandbox
Embedding a Bash environment lets the agent run the code it generates, install dependencies, execute build commands, run test suites and debug failures. Execution always occurs inside an isolated sandbox that enforces:
Resource limits (CPU, memory, disk) to prevent runaway processes.
Network isolation (default deny, optional whitelist) to avoid unauthorized outbound traffic.
File‑system isolation so the agent can only access its own workspace.
Execution timeouts that automatically terminate long‑running commands.
Common sandbox implementations include Docker containers, gVisor/Firecracker micro‑VMs, or lightweight WebAssembly runtimes. The sandbox is a prerequisite for a reliable "write‑run‑verify" loop.
3. AGENTS.md Memory Store
Instead of fine‑tuning the model, long‑term knowledge is kept in a markdown file ( AGENTS.md) that is automatically injected into the model’s context when needed. This provides:
Versioned, editable knowledge about project conventions, architecture decisions, known pitfalls and team policies.
Transparent, human‑readable storage that can be reviewed, edited or audited via Git.
Zero‑cost knowledge extension: adding a new rule or best practice only requires editing the markdown file, not retraining the model.
The injection workflow works as follows:
During execution the agent writes new facts to AGENTS.md (e.g., "React 19 introduces the new use() hook").
The file is stored in the project repository.
On the next run, the Harness loader reads AGENTS.md and prepends its contents to the system prompt, giving the model immediate access to the updated knowledge.
4. Web Search + Model Context Protocol (MCP)
To overcome the static knowledge cutoff, the agent can perform live web searches. A robust Web Search component must:
Translate the model’s natural‑language intent into concise search queries.
Rank and filter results, preferring official documentation, RFCs or trusted technical blogs.
Extract the core content from HTML pages, stripping navigation, ads and unrelated sections.
Present the cleaned information back to the model in a token‑efficient format.
The Model Context Protocol (MCP) standardizes how the model interacts with external tools and data sources. By exposing APIs for code repositories, issue trackers, monitoring dashboards or internal knowledge bases, MCP turns the agent from a pure language model into a connector that can read logs, query databases and invoke services on demand.
5. Context Engineering
Long sessions quickly fill the model’s context window with irrelevant or contradictory information, a phenomenon known as Context Rot . Harness mitigates this by applying a set of engineering strategies:
Compression : periodically summarise earlier dialogue, tool outputs and intermediate results, replacing verbose logs with concise abstracts.
Tool‑output offloading : store large outputs (e.g., generated code, logs, search results) as files and keep only a short reference in the prompt.
Skill‑based loading : load only the knowledge (skills) required for the current phase (design, coding, testing) to keep the token budget focused.
Layered context : maintain three tiers—core (system prompt & immutable constraints), work (current task data) and history (compressed summaries). Each tier has its own eviction policy.
These techniques keep the signal‑to‑noise ratio high, reduce token waste and preserve model reasoning quality over multi‑step workflows.
6. Orchestration + Hooks
Complex problems are decomposed into sub‑tasks handled by specialized sub‑agents. The orchestration layer decides:
Task scheduling and dependency ordering (parallel vs. sequential).
Model selection per sub‑task (small, fast model for formatting; large, accurate model for design).
Result aggregation and conflict resolution.
Hooks are deterministic checks inserted at key execution points to guarantee quality and safety:
Lint/format validation : run linters on generated code and request regeneration on failure.
Schema enforcement : verify JSON, YAML or API responses against predefined schemas.
Security policies : block dangerous commands (e.g., rm -rf /) or disallow unauthorized network calls.
Cost monitoring : track token usage and abort or throttle when budgets are exceeded.
By combining probabilistic generation with deterministic validation, the system achieves higher reliability without sacrificing the model’s creative ability.
System Prompt – the neural backbone
The System Prompt is injected into every interaction and defines the agent’s role, immutable constraints and high‑level policies. It:
Specifies the agent’s domain (e.g., "frontend development assistant") and explicitly delegates out‑of‑scope requests to other agents.
Encodes safety rules, file‑naming conventions and coding standards that all downstream components must obey.
Provides the minimal essential knowledge that must be present in every turn, while larger, mutable knowledge lives in AGENTS.md.
In this way the System Prompt acts as the central nervous system, coordinating the file system, sandbox, memory store, web search, context manager and orchestrator.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
