How to Inject Four‑Layer Memory into Every Dialogue with system_prompt.py

This article explains Hermes' three‑layer system prompt architecture—Stable, Context, and Volatile—detailing how ordered memory injection, snapshot freezing, SQLite caching, and ephemeral prompts dramatically improve LLM prefix‑cache hit rates while avoiding token waste and security risks.

James' Growth Diary
James' Growth Diary
James' Growth Diary
How to Inject Four‑Layer Memory into Every Dialogue with system_prompt.py

01 Three‑Layer Architecture: Stable / Context / Volatile

Most agents concatenate identity, tool guidance, user memory, and timestamps into a single string, causing the prefix cache to miss on every turn. Hermes solves this by separating the system prompt into three layers based on change frequency, placing the most stable content first.

02 Stable Layer: Identity, Tool Guidance, and Skills Index

The Stable layer forms the prompt skeleton. Its fill order is:

Load SOUL.md for identity (fallback to a default if missing).

Inject guidance only for tools present in agent.valid_tool_names (e.g., MEMORY_GUIDANCE, SKILLS_GUIDANCE, KANBAN_GUIDANCE).

Add model‑specific guidance: absolute paths for Gemini, concise hints for GPT.

Append the Skills index built with a two‑level cache (LRU in memory + disk snapshot) keyed by (skills_dir, platform, available_tools, disabled_list).

Finally, add environment and platform hints.

Tool‑conditional injection prevents unnecessary tokens and avoids hallucinating unavailable tools.

03 Context Layer: cwd‑Aware Project Context

The Context layer injects project‑specific files using build_context_files_prompt. It follows a priority chain, taking the first matching file:

.hermes.md / HERMES.md (search up to git root) → AGENTS.md / agents.md (cwd only) → CLAUDE.md / claude.md (cwd only) → .cursorrules + .cursor/rules/*.mdc

If multiple files exist, only the highest‑priority one is used to avoid contradictory instructions. Each file is scanned by _scan_context_content() to block prompt‑injection patterns such as "ignore previous instructions" or hidden Unicode direction characters.

04 Volatile Layer: Frozen Memory Snapshots

Volatile content is injected in a fixed order: MEMORY.md snapshot → USER.md snapshot → external memory providers → timestamp (last). The snapshot is taken once at session start via load_from_disk() and never updated during the conversation, ensuring the prefix cache remains stable across tool‑call rounds.

New memories written with memory(action="add") affect only the next session, not the current one.

05 One‑Time Build and Persistent Reuse

After constructing the system prompt, Hermes stores it in SQLite. On resume, the exact stored prompt is reloaded, giving a perfect cache hit. The three execution paths are:

New session: build prompt, store in SQLite.

Subsequent turns in the same session: reuse in‑memory cached prompt (O(1)).

Resume a previous session: load prompt from SQLite, cache hits again.

The only trigger for rebuilding is context compression, which calls invalidate_system_prompt() to clear the cache and refresh the frozen snapshot.

06 Ephemeral System Prompt

Ephemeral prompts are appended only at API‑call time and never stored in the cached prompt or SQLite. Examples include indicating a cron job context or warning about an irreversible operation. This separation prevents transient information from polluting the long‑term cache.

07 Injection Order of the Four Layers

Placing the timestamp at the very end preserves cache efficiency because LLM prefix caches only keep the unchanged prefix. If the timestamp appears earlier, any change invalidates all following tokens, wiping out up to 95% of the cache benefit.

Common Pitfalls

Putting the timestamp before the stable content destroys cache hits.

Expecting memory added during a session to be visible immediately; it only appears after the next session or a compression trigger.

Having conflicting context files (e.g., both AGENTS.md and .cursorrules)—use the priority chain to select one.

Forgetting to set TERMINAL_CWD in gateway mode, causing Hermes' own AGENTS.md to be injected and waste tokens.

Storing rapidly changing data (e.g., real‑time weather) in _cached_system_prompt —use the ephemeral prompt instead.

Conclusion

The three‑layer design, frozen snapshots, SQLite caching, and careful separation of ephemeral data together raise prefix‑cache hit rates from 0% to about 95% and cut token costs to roughly 20% of the baseline, while also providing security against prompt‑injection attacks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Prompt EngineeringHermessystem promptmemory snapshotephemeral promptLLM caching
James' Growth Diary
Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.