Industry Insights 40 min read

How to Build Self‑Evolving Multi‑Agent Systems: Lessons from an OpenClaw Deployment

This article details the design, operation, and continuous improvement of a six‑agent autonomous system, covering memory management, protocol design, self‑healing mechanisms, and practical engineering lessons for building reliable, self‑evolving AI workflows.

Alibaba Cloud Developer

Apr 10, 2026

How to Build Self‑Evolving Multi‑Agent Systems: Lessons from an OpenClaw Deployment

System Overview

The platform runs a fleet of autonomous agents that operate 24/7 on a single Mac workstation. Core agents are:

Zoe (Orchestrator) – designs technical solutions, runs health‑checks (10:00, 14:00, 22:00), inspects memory usage and coordinates round‑table discussions.

AI Sentinel (ainews) – pulls >100 information sources (GitHub Trending, arXiv, RSS, etc.), ranks items, evaluates impact on existing systems and writes daily technical‑intelligence reports.

Trading Spider – quantitative analyst with 21 cron jobs, 20 CLI tools and 15 Skills (≈68 k lines). Uses a hybrid 65 % quantitative / 35 % LLM scoring model for A‑share, US‑stock and commodity markets.

Macro – provides a four‑layer macro‑to‑micro factor package (7 cron jobs) that feeds the trading pipeline.

Content Spider – consumes AI‑sentinel insights, macro analysis and trading views to generate drafts, scores them with a Ripple prediction engine and reflects on outcomes.

Butler – personal‑assistant integration with Apple Reminders, Calendar, Health, Notes and Shortcuts (7 cron jobs).

ACP Coding Experts – pool of LLM‑powered coding agents (Pi, Claude‑Code, Codex, OpenCode, Gemini, GPT‑5.3‑Codex) accessed via the sessions_spawn protocol (max 6 concurrent sessions, 120 min TTL).

All agents share a common .learnings/ directory where errors, lessons and feature requests are recorded instantly. A nightly reflection cron (23:00‑23:45) promotes entries that appear ≥3 times into MEMORY.md (kept < 3 k tokens). This creates a self‑evolving knowledge base without manual rule authoring.

Context Management ("Context is the Agent OS")

Unbounded context leads to entropy and crashes. The system uses a two‑layer approach:

Context Engineering – defines the information architecture: SOUL.md – immutable identity, hard constraints and decision framework (manual edits only). AGENTS.md – operational specifications and communication protocol.

Skills are loaded on demand via extraDirs to avoid injecting all 68 k lines of Trading Skills into the prompt. shared-context/ – cross‑agent state files read as needed.

Obsidian vault for cold storage (no prompt impact).

Harness – automated lifecycle management that runs on every session: compaction (memoryFlush): when a session exceeds 40 000 tokens , a custom prompt extracts decisions, state changes, lessons and blockers into memory/YYYY‑MM‑DD.md. contextPruning: removes context older than 6 h , keeping the last three assistant messages. session reset: daily at 05:00 or after 30 min idle time. session maintenance: deletes files older than 7 days and caps total disk usage at 100 MB . self‑improving‑agent Skill: on agent start, injects historical experience from .learnings/ into the session.

These mechanisms keep the active LLM context under control, prevent OOM, and ensure that only high‑value information persists.

Memory Hierarchy

The system mirrors human cognition with five layers:

L1 – Identity Layer : SOUL.md (eternal, edited only with explicit user confirmation).

L2 – Long‑Term Memory : MEMORY.md (< 3 k tokens), automatically maintained by the reflection loop.

L3 – Mid‑Term Memory : daily snapshots in memory/YYYY‑MM‑DD.md plus memory.db, created by the compaction mechanism.

L4 – Short‑Term Memory : instant entries in .learnings/ (errors, lessons, feature requests).

L5 – Persistent Storage : Skills, Obsidian vault, ontology schema and a vector store for semantic retrieval.

During a new session the bootstrap hook reads SOUL.md → AGENTS.md → MEMORY.md → .learnings/, then performs a vector search over memory/ and shared-context/ to restore the agent’s state.

Multi‑Agent Communication Protocol

Placing agents in a Discord channel without a strict protocol caused ACK storms. The solution is a three‑state protocol with a unique ack_id: request – initiator sends a message with @agent, desired action, deadline and ack_id. confirmed – responder acknowledges the request, optionally providing a version or partial result. final – responder delivers the conclusive output; all participants enter a silent state.

Thread‑level rules (V1 protocol) enforce a single ack_id per thread, prohibit replies after final, and define escalation timeouts (5 min → reminder, 10 min → Zoe arbitration).

Shared‑context files replace message‑driven polling; agents read structured JSON (e.g., macro factor packages, tech‑radar) directly.

Task Watcher – Asynchronous Callback Event Bus

Agents often promise a task but never complete it. The Task Watcher provides reliable end‑to‑end monitoring:

Task registration in tasks.jsonl.

Watcher polls task status every few minutes.

Adapter plugins for specific services (e.g., Xiaohongshu review, GitHub PR, ACP coding).

Policy engine controls notification frequency, escalation and retry limits (max 1 retry before dead‑letter).

Notifier sends Discord alerts; a dead‑letter queue stores permanently failed tasks.

Communication Guardrail & Request Lifecycle

To avoid misuse of the message field and ambiguous success states, Zoe built a guardrail library ( agent_comm_guardrail.py, 383 lines) and a request‑state model ( agent_request_models.py, 289 lines). The lifecycle consists of eleven states:

accepted → routed → queued → started → completed → delivered

All state transitions are persisted in requests.jsonl and events.jsonl for full auditability.

Security Boundaries

Execution permissions : exec.security: allowlist prevents agents from executing arbitrary commands.

Configuration protection : SOUL.md and openclaw.json are read‑only for agents; any change requires user confirmation.

Key isolation : API keys are stored only in environment variables, never written to session logs or Discord.

Code review : all ACP‑generated code passes a review pipeline before deployment.

Five‑Layer Engineering View

Communication Layer – three‑state protocol, ack_id, shared‑context files.

Memory Layer – five‑level storage, automatic compaction, nightly reflection.

Self‑Healing Layer – heartbeat‑guardian, session maintenance scripts, automatic restart on failure.

Evolution Layer – .learnings/ → MEMORY.md → Skills → ClawHub pipeline that lets agents design and publish new capabilities.

Orchestration Layer – Zoe’s inspections, round‑table hosting, Task Watcher coordination and ACP delegation.

Key Configuration (OpenClaw Harness)

{
  "compaction": {
    "mode": "safeguard",
    "memoryFlush": {
      "enabled": true,
      "softThresholdTokens": 40000,
      "prompt": "Distill to memory/YYYY-MM-DD.md. Focus: decisions, state changes, lessons, blockers."
    }
  },
  "contextPruning": {
    "mode": "cache-ttl",
    "ttl": "6h",
    "keepLastAssistants": 3
  },
  "session": {
    "reset": {"mode": "daily", "atHour": 5, "idleMinutes": 30},
    "maintenance": {"pruneAfter": "7d", "maxDiskBytes": 104857600}
  },
  "acp": {"maxConcurrentSessions": 6, "ttlMinutes": 120}
}

These values are the result of multiple incident post‑mortems (e.g., a 235 K‑token session crash, data‑table truncation, rule degradation).

Data Sources

A‑Share : AKShare, TuShare Pro – real‑time quotes, fundamentals, order‑book, north‑bound flow.

US/HK Stocks : yfinance, Finnhub – quotes, news, fundamentals.

Technical Intelligence : Tavily, 13 RSS feeds, GitHub Trending, 54 platform hot‑lists, arXiv.

Browser Rendering : agent-browser (Playwright) for JS‑rendered pages (X/Twitter, Xueqiu, etc.).

Deployment Details

Hardware : local Mac running 24/7.

Process supervision : launchctl with ThrottleInterval=10.

Self‑healing scripts : 2 086 lines covering heartbeat‑guardian, cron health checks and memory maintenance.

Backup : full backup daily at 03:00.

Monitoring : Zoe performs three daily inspections; a system‑wide crontab health check runs every 15 minutes.

Knowledge archiving : Obsidian vault synchronized via obsidian‑livesync.

LLM Model Stack & Fallback Chain

Main dialogue / reflection / round‑table – GPT‑5.4 .

ACP coding tasks – K2.5 / GPT‑5.4 (cost‑based).

Cron daily jobs – Qwen‑3.5+ / K2.5 .

Heartbeat / health checks – Ollama qwen3:8b (free).

Fallback order: gpt‑5.4 → k2.5 → qwen3.5‑plus → ollama/qwen3:8b.

Key Takeaways

≈90 % of effort is engineering (session bloat, message storms, config corruption), not model research.

AI “smartness” can be harmful in production; explicit hard rules outperform soft suggestions.

Continuous systems inevitably degrade – layered anti‑entropy mechanisms (compaction, pruning, heartbeat, regular inspections) are essential.

Collaboration hinges on a well‑defined protocol, not on prompting alone.

The greatest value of agents is their ability to participate in system design, not just execution.

Getting Started

Start with a single agent. Keep SOUL.md minimal – only core constraints; move non‑essential rules to on‑demand Skills.

Configure session management from day one:

idleMinutes=30

pruneAfter=7d

maxDiskBytes=100MB

Enable .learnings/ and the nightly reflection cron immediately – without reflection the system is just a chatbot.

When adding a second agent, allocate a separate Discord bot, tune requireMention, textChunkLimit and delivery.mode, and enforce the three‑state protocol.

Scale to the full six‑agent lineup over a few weeks, allocating ~½ day per new agent for debugging communication pairs, resource contention and rule compatibility.

memory management multi-agent systems LLM operations AI workflow automation self-healing architecture

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.