Artificial Intelligence 23 min read

Why a Tiny Agent Loop Exposes the Real Engineering Hurdles of AI Agents

The article walks through building a minimal 20‑line agent loop, explains each step—from reading a task to invoking tools and feeding observations back—then shows how real systems like Claude Code, OpenClaw and Pi add layers of harness, memory, permission and validation to make the loop safe and reliable in production.

Architect

Apr 20, 2026

Why a Tiny Agent Loop Exposes the Real Engineering Hurdles of AI Agents

When asked "What does the minimal execution chain of an Agent look like?" the author decides to hand‑craft the simplest possible agent to expose the core loop and then discuss why production agents need far more surrounding infrastructure.

Minimal Agent Loop

The loop does exactly four things:

Read the user task.

Ask the model which step to take next.

Call the selected tool according to the model's request.

Feed the tool's result back into the context and repeat.

Implemented in about 20 lines of JavaScript, the core function looks like this:

async function runAgent(task: string) {
  const messages = [{ role: "user", content: task }];
  for (let step = 0; step < 8; step++) {
    const response = await model.create({ messages, tools });
    messages.push({ role: "assistant", content: response.content });
    if (!response.toolCall) { return response.text; }
    const observation = await runTool(response.toolCall);
    messages.push({ role: "tool", content: observation });
  }
  return "Stopped: step limit reached.";
}

The loop is deliberately tiny: it does not handle permissions, retries, output trimming, or any other safety checks. It simply demonstrates that an agent can be made to run with a handful of lines.

Tools Are Not Just Functions

Instead of exposing raw JavaScript functions, each tool must be described with a name, a human‑readable description, and a JSON schema for its parameters. For example:

const tools = [{
  name: "read_file",
  description: "Read a text file inside the current workspace.",
  inputSchema: {
    type: "object",
    properties: { path: { type: "string" } },
    required: ["path"]
  }
}];

This structure is what modern Function‑Calling or Tools APIs expect, allowing the model to emit a structured call like {"name":"read_file","arguments":{"path":"README.md"}} instead of free‑form text.

Why the Loop Needs a Harness

Running the bare loop is easy, but it quickly runs into problems:

Infinite tool calls (the model never stops).

Tool failures that are silently ignored.

Context bloat from reading too many files.

Premature answers before verification.

Log output being mistaken for final results.

To prevent these issues, a "harness" layer adds runtime guards such as:

Maximum loop iterations.

Maximum tool‑call count.

Per‑tool timeout.

Token and cost budgeting.

Output trimming.

Error classification and recovery.

Confirmation for high‑risk actions.

Logging and replay.

Task‑completion criteria.

Pi’s implementation, for instance, provides beforeToolCall and afterToolCall hooks that let the system validate parameters, enforce whitelists, and post‑process results before the model sees them again.

Memory Management

The naive loop stores every message, tool observation and assistant reply in a single messages array. For short tasks this works, but longer sessions suffer from:

Growing tool output.

Stale error paths lingering in context.

Rising token cost.

Model distraction by irrelevant history.

Loss of state after a restart.

The author proposes a three‑layer memory model:

Current context : only the messages needed for the current step.

Persistent facts : project rules, user preferences stored in Markdown, a DB, or a profile file.

Procedural experience : reusable skills or playbooks that capture how to solve a class of tasks.

OpenClaw and Clawdbot store their memory as files like memory/2023‑04‑01.md so that it can be audited and versioned.

Permission Layers

The most dangerous tool is run_command because it gives the model shell access. The author suggests a four‑tier permission model:

Read‑only : list_files, read_file – allowed without confirmation.

Safe execution : whitelisted test commands such as npm test, pytest – also auto‑approved.

Write operations : write_file, apply_patch – require explicit user confirmation.

High‑risk actions : delete, network access, credential use – denied by default.

Anthropic’s Claude Code reports a 17 % false‑negative rate on a real‑world over‑eager‑action dataset, showing that even sophisticated classifiers cannot fully replace human approval for dangerous commands.

Verification Over "Done"

Simply returning "Task completed" is useless. A production coding agent must prove its work by reporting:

Which files were read.

Which tools were invoked.

What changes were made.

Which tests passed.

Any lint or type‑check results.

Remaining risks that need human review.

These checks turn a confident‑sounding model into a trustworthy engineering tool.

Putting It All Together

When the author looks back at Claude Code, OpenClaw and Pi, the core loop is identical to the minimal version; the difference lies entirely in the surrounding harness: permission checks, memory handling, validation, logging, and hooks. The statement that captures the whole insight is:

30 minutes hand‑crafted an Agent; the loop is the skeleton, and Harness is the flesh that lets it survive in the real world.

From runnable to usable – the missing runtime boundaries

Conclusion

The takeaway is that building a demo‑ready agent is trivial, but engineering a stable, secure, and auditable agent requires a systematic harness that enforces boundaries, manages memory, validates results, and controls permissions. As model capabilities improve, the competitive edge will increasingly come from how well the surrounding harness is designed.

memory management Tool Integration Permission AI Agent Function Calling agent loop Harness

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.