Artificial Intelligence 58 min read

Deep Dive into Hermes Agent: Memory Architecture That Makes AI Smarter

Hermes Agent is an open‑source, self‑hosted AI agent framework that combines a layered persistent memory system, automatic skill generation, a unified tool registry, and multi‑platform messaging gateways, enabling agents to retain knowledge across sessions and continuously improve their capabilities.

Architect's Guide

May 30, 2026

Deep Dive into Hermes Agent: Memory Architecture That Makes AI Smarter

1. Introduction

From 2024 to 2025 many AI‑agent frameworks suffered from memory hallucination : each new session required the user to repeat codebase, project context and preferences. On 25 Feb 2026 Nous Research released Hermes Agent . Within seven weeks it gathered >95 k GitHub stars (≈105 k currently), becoming the fastest‑growing open‑source agent framework of the year. Its success stems from a system‑level solution that truly solves the “learning = verification” problem.

2. Project Background and Technical Lineage

Hermes 3 (Aug 2024) – built on Meta Llama 3.1 (8B/70B/405B). Introduced robust tool‑call JSON schema and reliable single‑turn assistant responses.

Hermes 4 (2025) – added mixed‑precision inference and large‑scale synthetic data generation.

Hermes Agent (Feb 2026) – integrates persistent memory, automatic skill creation and multi‑platform support into a single open‑source package. The model is replaceable (supports 200+ models); the architecture is the core asset.

3. System Overview

AIAgent Loop – orchestrated by run_agent.py (~10.7 k lines). Handles provider selection, prompt construction, tool dispatch, fail‑over and state persistence.

Prompt System – modules prompt_builder.py, prompt_caching.py, context_compressor.py manage system prompts and caching.

Provider Resolution – maps (provider, model) tuples to (api_mode, api_key, base_url).

Tool System – central registry tools/registry.py with 47 tools grouped into 19 toolsets.

Infrastructure Subsystems

Session Persistence – SQLite with FTS5 full‑text search.

Messaging Gateway – long‑running process supporting 18 adapters (Telegram, Discord, Slack, etc.).

Plugin System – three‑way discovery (user, project, pip entry points).

Cron Scheduler – first‑class task scheduler.

Directory Structure (excerpt)

hermes-agent/
├── run_agent.py      # AIAgent core loop (~10,700 lines)
├── cli.py            # Interactive terminal UI (~10,000 lines)
├── model_tools.py
├── toolsets.py
├── hermes_state.py
├── hermes_constants.py
├── batch_runner.py
├── agent/
│   ├── prompt_builder.py
│   ├── context_engine.py
│   ├── context_compressor.py
│   ├── prompt_caching.py
│   ├── auxiliary_client.py
│   ├── memory_manager.py
│   └── memory_provider.py
├── tools/
│   ├── registry.py
│   ├── approval.py
│   ├── terminal_tool.py
│   ├── browser_tool.py
│   ├── mcp_tool.py
│   └── environments/   # backend implementations (local, docker, ssh, modal, daytona, singularity)
├── gateway/
│   ├── run.py        # GatewayRunner (~9,000 lines)
│   ├── session.py
│   └── platforms/    # 18 platform adapters
└── cron/

4. Execution Modes

4.1 CLI Session

用户输入 → HermesCLI.process_input()
  → AIAgent.run_conversation()
    → prompt_builder.build_system_prompt()
    → runtime_provider.resolve_runtime_provider()
    → API call (chat_completions / codex_responses / anthropic_messages)
    → tool_calls? → model_tools.handle_function_call() → loop
    → final response → display → save to SessionDB

4.2 Gateway Message

平台事件 → Adapter.on_message() → MessageEvent
  → GatewayRunner._handle_message()
    1. User authorization check
    2. Parse session key (per‑platform isolation)
    3. Slash‑command handling (/model, /skills, /compress, …)
    4. Create AIAgent with session history
    5. AIAgent.run_conversation() (progress shown asynchronously)
    6. Return response via adapter

4.3 Cron Task

调度器触发 → 从 jobs.json 加载到期任务
  → 创建新 AIAgent (无历史记录)
  → 注入附加技能作为上下文
  → 运行任务提示词
  → 将响应投递到目标平台
  → 更新任务状态和 next_run

5. Core Engine: AIAgent Loop

Assembles system prompts via prompt_builder.py.

Selects appropriate provider/API mode (chat_completions, codex_responses, anthropic_messages).

Supports interruptible API calls with _api_call_with_interrupt().

Executes tools either sequentially or concurrently via ThreadPoolExecutor. Interactive tools (e.g., clarify) are forced sequential.

Enforces strict message‑role alternation (no consecutive assistant or user messages).

Tracks iteration budget (default 90 turns) and delegation budget (default 50 turns).

Implements provider fallback on rate‑limit, 5xx or authentication failures.

5.1 Turn Lifecycle

run_conversation()
  1. Generate task_id if missing
  2. Append user message to history
  3. Build or reuse cached system prompt
  4. Pre‑compress if context >50%
  5. Build API messages (chat_completions, codex_responses, anthropic_messages)
  6. Inject temporary prompts (budget warnings, context pressure)
  7. Apply Anthropic prompt‑cache flag if needed
  8. Initiate interruptible API call
  9. Parse response:
       - tool_calls → execute → append result → repeat step 5
       - text response → persist session → optionally refresh memory → return

5.2 Message Format (OpenAI compatible)

{"role": "system", "content": "..."}
{"role": "user", "content": "..."}
{"role": "assistant", "content": "...", "tool_calls": [...]}
{"role": "tool", "tool_call_id": "...", "content": "..."}

Strict alternation rules: system → user → assistant → user …; during tool calls the sequence is assistant (with tool_calls) → tool → … → assistant. Two consecutive assistant or user messages are never allowed; only tool roles may appear consecutively.

5.3 Interruptible API Calls

┌───────────────────────────────────────┐
│  Main thread          API thread       │
│   Wait: HTTP POST                     │
│   - response ready ──▶ provider       │
│   - interrupt event                  │
│   - timeout                         │
└───────────────────────────────────────┘
When an interrupt (new user message or <code>/stop</code>) occurs, the API thread is aborted, the partial response is discarded, and the agent can process the new input or shut down cleanly.

6. Tool Execution Strategy

Sequential vs. Concurrent

Single tool call – executed directly in the main thread.

Multiple tool calls – dispatched to a ThreadPoolExecutor. Interactive tools (e.g., clarify) are forced sequential.

Results are reordered to match the original tool‑call order.

Tool Execution Flow

for each tool_call in response.tool_calls:
  1. Resolve handler from tools/registry.py
  2. Trigger pre_tool_call plugin hook
  3. Check for dangerous commands (tools/approval.py)
     - If dangerous: call approval_callback and wait for user confirmation
  4. Execute handler with args and task_id
  5. Trigger post_tool_call plugin hook
  6. Append {"role": "tool", "content": result} to history

Agent‑Level Tool Interception todo – read/write local task state. memory – write persistent memory file (character limit). session_search – query session history via SQLite. delegate_task – create an independent sub‑agent.

7. Memory Architecture: Layered Memory System

The memory system is the most innovative component, providing a clear separation between “hot” prompt memory and “cold” archival storage.

7.1 Memory Layers

Layer 1 – Core Persistent Memory MEMORY.md – environment facts, project conventions, discovered bugs, completed‑task logs, effective skills. Limit: 2,200 chars (~800 tokens), typically 8‑15 entries. USER.md – user profile (name, role, timezone, communication preferences, dislikes, workflow habits, technical level). Limit: 1,375 chars (~500 tokens), typically 5‑10 entries.

Layer 2 – Session Search

All CLI and gateway sessions stored in SQLite ( ~/.hermes/state.db) with FTS5 full‑text search.

Allows the agent to retrieve weeks‑old conversations even when not in active memory.

Layer 3 – External Memory Providers (v0.7.0+)

Eight built‑in providers (Honcho, Mem0, OpenViking, Hindsight, Holographic, RetainDB, ByteRover, Supermemory) can be selected at runtime.

They run in parallel with the built‑in memory, enriching knowledge graphs, semantic search and automatic fact extraction.

7.2 Memory Entry Operations

# Add a new memory entry
memory(action="add", target="memory", content="New fact")

# Replace an existing entry (substring match)
memory(action="replace", target="memory", old_text="dark mode", content="User prefers light theme in VS Code, dark terminal")

# Delete an obsolete entry
memory(action="remove", target="user", old_text="old preference")

7.3 Capacity Management

When memory usage exceeds 80 % of its quota, Hermes returns an error JSON indicating overflow and suggests merging entries before adding new ones. The UI shows the usage percentage at the top of the system prompt.

7.4 Memory Safety Scanning

Before accepting a new entry, the system automatically rejects exact duplicates and runs a security scan for prompt‑injection patterns, credential leakage, SSH backdoors and invisible Unicode characters.

8. Skill System: Procedural Memory

Skills are Hermes Agent’s way of persisting complex problem‑solving workflows so they can be reused without re‑learning.

8.1 Progressive Disclosure

Level 0: skills_list() → [{name, description, category}, …]   (~3 k tokens)
Level 1: skill_view(name) → full content + metadata (on demand)
Level 2: skill_view(name, path) → specific file content (on demand)

Only the needed level is loaded, minimizing token consumption.

8.2 SKILL.md Format

---
name: my-skill
description: Brief description of the skill
version: 1.0.0
platforms: [macos, linux]   # optional
metadata:
  hermes:
    tags: [python, automation]
    category: devops
    fallback_for_toolsets: [web]   # show only when web tools are missing
    requires_toolsets: [terminal]   # show only when terminal tools are available
    config:
      - key: my.setting
        description: "Controls the behavior"
        default: "value"
        prompt: "Set the value"
---
# Skill Title

## When to Use
Explain trigger conditions.

## Process
1. First step
2. Second step

## Common Errors
- Known failure mode and fix

## Verification
How to confirm success.

8.3 Skill Management Commands (CLI)

hermes skills browse                # List all hub skills
hermes skills search kubernetes      # Search across all sources
hermes skills install openai/skills/k8s   # Install with security scan
hermes skills check                 # Check for upstream updates
hermes skills audit                 # Rescan all hub skills
hermes skills publish skills/my-skill --to github --repo owner/repo

8.4 Conditional Activation

Skills can declare fallback_for_toolsets (show only when those toolsets are missing) or requires_toolsets (show only when required toolsets are present). Example: the built‑in duckduckgo-search skill is hidden when a dedicated web API key is configured and automatically appears as a fallback otherwise.

9. Prompt Assembly System

Prompt construction is performance‑critical. Hermes separates a cached system‑prompt prefix from temporary per‑turn additions.

9.1 Ten‑Layer Prompt Stack (cached prefix)

Agent identity – ~/.hermes/SOUL.md or default identity.

Tool awareness guidelines – how to use memory, session search, and tools.

Honcho static block – dialectic user modeling data.

Optional system messages – user‑provided overrides.

Frozen MEMORY snapshot.

Frozen USER snapshot.

Skill index – compressed list of installed skills.

Context files – project‑specific files (e.g., .hermes.md, AGENTS.md, .cursorrules).

Timestamp / session ID.

Platform‑specific prompt (CLI, Telegram, Discord, …).

9.2 SOUL.md Loading (code excerpt)

def load_soul_md() -> Optional[str]:
    soul_path = get_hermes_home() / "SOUL.md"
    if not soul_path.exists():
        return None
    content = soul_path.read_text(encoding="utf-8").strip()
    content = _scan_context_content(content, "SOUL.md")   # security scan
    content = _truncate_content(content, "SOUL.md")       # max 20k chars
    return content

If SOUL.md exists, it replaces the built‑in default identity; otherwise the default is used.

9.3 Context File Priority

Priority 1 – .hermes.md / HERMES.md (walk up to Git root).

Priority 2 – AGENTS.md (current working directory only).

Priority 3 – CLAUDE.md (current working directory only).

Priority 4 – .cursorrules / .cursor/rules/*.mdc (current working directory only).

All files undergo security scanning, truncation (20 k‑char limit) and YAML front‑matter stripping before inclusion.

9.4 Temporary Per‑Turn Additions (never cached)

ephemeral_system_prompt

– temporary system additions.

Prefill messages – initial user context.

Gateway‑derived session context – platform‑specific data.

Honcho recall injected into the current turn.

These are never persisted in the cached prefix, preserving cache stability.

10. Context Compression and Caching

10.1 Compression Triggers

CLI pre‑flight compression – triggered when context > 50 % of the model window (checked before each API call).

Gateway automatic compression – triggered when context > 85 % (run between turns, more aggressive).

10.2 Compression Process

Trigger compression
  ├─ Flush memory to disk (prevent loss)
  ├─ Use auxiliary LLM to summarize intermediate turns
  ├─ Preserve last N messages (default 20)
  ├─ Keep tool call/result pairs intact
  └─ Create new session lineage ID (child session)

The lineage (original → compressed child → further compressed grand‑child) enables full reconstruction of the conversation history.

10.3 Pluggable Context Engine (v0.7.0+)

class ContextEngine(ABC):
    @abstractmethod
    def should_compress(self, conversation: List[dict], model_context_length: int) -> bool:
        ...

    @abstractmethod
    def compress(self, conversation: List[dict]) -> List[dict]:
        ...

The default implementation ( context_compressor.py) provides a lossy summarizer. Users can supply custom engines via the plugin system.

11. Plugin System and Extensibility

Discovery sources: ~/.hermes/plugins/ (user), .hermes/plugins/ (project), and pip entry points.

Memory provider plugins (e.g., Honcho, Mem0) implement MemoryProvider ABC in plugins/memory/. Only one provider is active at a time and can be selected via CLI or config.yaml.

Context engine plugins implement ContextEngine ABC in plugins/context_engine/.

Built‑in hooks (located in gateway/builtin_hooks/) are always registered. User hooks are discovered via gateway/hooks.py and can react to events such as on_session_start, on_message, on_tool_call, on_session_end.

12. Terminal Back‑Ends and Deployment Architecture

12.1 Six Terminal Back‑Ends

local – no isolation, runs on host (development, trusted users).

ssh – remote machine, dangerous‑command check enabled.

docker – container isolation, dangerous‑command check skipped (container is the security boundary).

singularity – HPC container, check skipped.

modal – cloud sandbox, check skipped.

daytona – cloud sandbox, check skipped.

12.2 Docker Hardening (excerpt)

# _SECURITY_ARGS (tools/environments/docker.py)
--cap-drop ALL                     # drop all Linux capabilities
--cap-add DAC_OVERRIDE            # allow write to bind mounts
--cap-add CHOWN                   # needed for package managers
--cap-add FOWNER                  # needed for package managers
--security-opt no-new-privileges  # prevent privilege escalation
--pids-limit 256                  # limit process count
--tmpfs /tmp:rw,nosuid,size=512m  # limited /tmp
--tmpfs /var/tmp:rw,noexec,nosuid,size=256m
--tmpfs /run:rw,noexec,nosuid,size=64m

Resource limits can be configured in ~/.hermes/config.yaml (CPU, memory, disk, persistence).

12.3 Deployment Options Comparison

Local (personal machine) – $0, best for development and testing.

VPS ($5 tier) – ~$5 / month, always‑online personal agent.

VPS with Docker – ~$5‑15 / month, isolated reproducible deployment.

AWS EC2 (t3.small) – ~$15 / month, production with AWS ecosystem.

Modal (serverless) – pay‑as‑you‑go, cost‑optimized, no idle cost.

Daytona (cloud dev) – pay‑as‑you‑go, persistent cloud workspaces.

13. Security Model: Defense‑in‑Depth

Layer 1: User authorization – allowlist, DM pairing.
Layer 2: Dangerous‑command approval – human‑in‑the‑loop.
Layer 3: Container isolation – Docker/Singularity/Modal.
Layer 4: MCP credential filtering – env‑var isolation.
Layer 5: Context file scanning – prompt‑injection detection.
Layer 6: Cross‑session isolation – separate SQLite stores.
Layer 7: Input sanitization – work‑dir validation for tools.

13.1 Dangerous‑Command Approval

Implemented in tools/approval.py, covering > 30 patterns (e.g., rm -r /, chmod 777, mkfs, DROP TABLE, systemctl stop, kill -9 -1, bash -c, curl … | sh). Mode is configurable in config.yaml:

approvals:
  mode: manual   # manual | smart | off (default manual)
  timeout: 60     # seconds to wait for user response

In container back‑ends (docker, singularity, modal, daytona) the approval check is intentionally skipped because the container provides a security boundary.

13.2 Memory Safety Scan

Before accepting a memory entry, the system scans for prompt‑injection patterns, credential leakage, SSH backdoors and invisible Unicode characters.

13.3 Skill Security Risks

Research (Medium, Apr 2026) identified a "Persistent Injection Vector": a malicious entry injected during a session becomes part of a generated SKILL.md file on disk. Because skill files lack signatures, the agent cannot distinguish user‑created skills from maliciously injected ones. Mitigations include enabling prompt‑injection scanning (default in v0.7.0), regular skill audits ( hermes skills audit) and running the gateway with Docker isolation.

14. Reinforcement Learning and Trajectory Generation

Hermes integrates with Nous Research’s Atropos RL environment, enabling large‑scale trajectory generation for model training.

Batch runner ( batch_runner.py) can generate thousands of tool‑call trajectories in ShareGPT or OpenAI message format.

Configurable workers, batch size, and toolset distributions (e.g., coding 40 %, research 30 %, general 30 %).

Automatic checkpointing and custom JSON export for research pipelines.

15. Design Principles and Engineering Philosophy

Prompt Stability – system prompts never change mid‑conversation (except explicit /model switches).

Observability – every tool call is visible to the user (CLI spinners, gateway messages).

Interruptibility – API calls and tool execution can be cancelled by new input or /stop.

Platform‑agnostic Core – a single AIAgent class serves CLI, gateway, ACP, batch and API server.

Loose Coupling – optional subsystems (MCP, plugins, memory providers, RL) are registered via plug‑ins, not hard dependencies.

Profile Isolation – each profile ( hermes -p <name>) has its own HERMES_HOME, config, memory, session DB and gateway PID, allowing concurrent independent agents.

16. Best‑Practice Guide

16.1 Getting the Best Results

Be specific – include file paths, error messages and expected behavior in the initial prompt to reduce clarification loops.

Provide context upfront – paste relevant logs or stack traces; the agent can then use its tools directly.

Use AGENTS.md for recurring instructions (e.g., "We use SQLAlchemy ORM, async/await, never commit .env"). This file is loaded automatically at session start.

Leverage tools – ask the agent to "find and fix the failing test" instead of manually opening files.

Use skills for complex workflows – browse with /skills, install with /install, then invoke the skill (e.g., /github-pr-workflow).

16.2 Memory Management

Memory vs. Skill – store facts (environment, preferences) in MEMORY/USER; store procedures and reusable workflows in skills.

Good memory entry example

# Concise environment description
User runs macOS 14, Homebrew, Docker Desktop, Zsh with oh‑my‑zsh.
Project ~/code/api uses Go 1.22, sqlc for DB queries, chi router.
Run tests with <code>make test</code>. CI via GitHub Actions (deploy.yml).

Capacity management – when usage exceeds 80 %, merge related entries before adding new ones (e.g., combine multiple "project uses X" lines into a single summary).

16.3 Performance and Cost Optimization

Preserve prompt cache – keep system prompt stable (same MEMORY, USER, context files) to benefit from provider‑side caching.

Compress before hitting limits – run /compress to summarize history and free tokens.

Delegate parallel work – use delegate_task to spawn sub‑agents for independent subtasks; each sub‑agent has its own iteration budget (default 50).

Batch operations with execute_code – write a short script to rename all .jpeg files to .jpg instead of issuing many individual terminal commands.

Switch models when appropriate – use a powerful model (Claude Sonnet, GPT‑4o) for complex reasoning, then switch to a cheaper model for formatting or simple tasks via /model.

16.4 Security Best Practices

Production deployment – run with Docker back‑end to enforce read‑only root and dropped capabilities.

Enable prompt‑injection scanning (default in v0.7.0).

Restrict dangerous commands – configure approvals.mode=manual (default) or off only in trusted container environments.

Audit installed skills regularly with hermes skills audit.

Limit gateway users – set TELEGRAM_ALLOWED_USERS, DISCORD_ALLOWED_USERS, etc.; avoid GATEWAY_ALLOW_ALL_USERS=true in production.

Monitor logs – hermes logs --follow to watch for unexpected tool calls.

Keep Hermes updated – security patches are released frequently.

16.5 Advanced CLI Usage

Multi‑line input – Alt+Enter (or Ctrl+J) inserts a newline without sending.

Interrupt – Ctrl+C aborts the current response; type a new message to redirect.

Resume previous session – hermes -c or hermes -r "Session Title".

Rename session – /title my-research for easy retrieval.

Toggle verbosity – /verbose cycles through off → new → all → verbose.

17. Comparison with OpenClaw (major competitor)

Core Architecture – Hermes treats the learning loop as a first‑class citizen; OpenClaw is tool‑call centric.

Memory System – Hermes uses layered (hot/warm/cold) memory with active curation; OpenClaw relies mainly on vector retrieval.

Skill System – Hermes auto‑generates and self‑improves skills; OpenClaw uses static skill definitions.

Learning Mechanism – Hermes continuously extracts knowledge and creates skills; OpenClaw has no built‑in learning.

Terminal Back‑Ends – Hermes supports six back‑ends (including cloud sandboxes); OpenClaw offers limited options.

Signal Support – Hermes supports Signal; OpenClaw does not.

Security Model – Hermes implements a seven‑layer defense‑in‑depth model; OpenClaw has fewer layers.

Model Flexibility – Hermes supports 200+ models from 18+ providers; OpenClaw is more restricted.

RL Integration – Hermes integrates with Atropos RL; OpenClaw lacks RL support.

License – Hermes is MIT‑licensed; OpenClaw’s license varies.

18. Known Limitations and Ongoing Research

Persistent Injection Risk – skills are plain Markdown files without signatures; community is working on signed skill packages.

Memory Capacity – fixed character limits for MEMORY.md and USER.md; external providers are recommended for large knowledge bases.

Synchronous Core Loop – AIAgent runs synchronously; high‑throughput batch workloads rely on batch_runner.py with multiprocessing.

19. Conclusion

Hermes Agent delivers a comprehensive, open‑source platform that solves the long‑standing memory hallucination problem by treating the learning loop as a first‑class architectural concern. Its layered memory, automatic skill generation, pluggable components and seven‑layer security model set a new benchmark for autonomous AI assistants. Ongoing work on signed skill distribution, richer external memory back‑ends and deeper RL integration will further solidify its position as the leading framework for self‑improving AI agents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

tool integration Open Source Security AI agent Memory Architecture

Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.