Why Immutable Historical Context Is the Core of Hermes’ Prefix‑Caching Performance Design
The article explains how Hermes relies on prefix caching—keeping the system prompt unchanged throughout a session—to achieve 70‑80% cache‑hit rates, reduce token costs by up to 60%, and shape its architecture across agents, sub‑agents, and background review tasks.
1. Introduction
After covering ACP protocol adaptation, this piece dives into the lower‑level design philosophy of prefix caching and how it shapes the entire Hermes architecture.
2. Why Caching Matters in AI Dialogues
Anthropic’s pricing shows cached input costs half of standard input ( 1.50 / MTok vs 15 / MTok). More important is the efficiency of the context window. A typical multi‑turn Hermes conversation repeatedly sends the same ~8K‑token system prompt each round, so any change invalidates all previous KV‑cache entries, creating a “bucket effect”: longer dialogs benefit more from an unchanged prompt, while a single change wastes all prior work.
Round 1: system prompt (~8K) + user message (~2K) = 10K tokens
Round 2: system prompt (~8K) + history (~2K) + user (~2K) = 12K tokens
Round 10: system prompt (~8K) + history (~18K) + user (~2K) = 28K tokens3. The system_and_3 Strategy
Anthropic allows up to four cache_control breakpoints. Hermes implements the system_and_3 policy (see agent/prompt_caching.py, ~80 lines) that marks:
System prompt (unchanged, always cached)
Third‑last message (usually previous user input or tool result)
Second‑last message
Last message (current user input)
This yields a stable cache‑hit rate of 70‑80% in long conversations.
def apply_anthropic_cache_control(api_messages, cache_ttl="5m"):
messages = copy.deepcopy(api_messages)
if not messages:
return messages
marker = {"type": "ephemeral"}
if cache_ttl == "1h":
marker["ttl"] = "1h"
# Breakpoint 1: system prompt
if messages[0].get("role") == "system":
_apply_cache_marker(messages[0], marker)
remaining = 3
else:
remaining = 4
# Breakpoints 2‑4: last three non‑system messages
non_sys = [i for i in range(len(messages)) if messages[i].get("role") != "system"]
for idx in non_sys[-remaining:]:
_apply_cache_marker(messages[idx], marker)
return messages4. Immutable System Prompt Discipline
Hermes enforces a strict rule: the system prompt is built once during AIAgent.__init__ and never altered for the lifetime of the agent.
# agent/agent_init.py
agent._cached_system_prompt = agent._build_system_prompt(system_message)When a session is restored, the cached prompt is read from the database, avoiding recomputation even after a process restart.
# agent/conversation_loop.py
def _restore_or_build_system_prompt(agent, system_message, conversation_history):
session_row = agent._session_db.get_session(agent.session_id)
stored_prompt = session_row.get("system_prompt")
if stored_prompt:
agent._cached_system_prompt = stored_prompt
return
agent._cached_system_prompt = agent._build_system_prompt(system_message)
agent._session_db.update_system_prompt(agent.session_id, agent._cached_system_prompt)The prompt is divided into three immutable layers:
stable : identity definition, tool guides, skill hints – never change within a session.
context : context files, AGENTS.md – also never change.
volatile : memory snapshots, user profile, timestamps – fixed for the session but may differ across sessions.
Keeping the volatile layer unchanged during a session ensures perfect prefix‑cache hits, at the cost of delayed memory visibility (new memories appear only in the next round).
5. Sub‑Agent Cache Inheritance
When Hermes spawns a sub‑agent via delegate_tool, the sub‑agent inherits the parent’s cached system prompt:
# agent/agent_runtime_helpers.py
def _spawn_sub_agent(agent, task, ...):
sub_agent = AIAgent(
system_message=agent._cached_system_prompt, # inheritance!
...
)This sharing means all sub‑agents in the same conversation hit the same cache prefix, reducing parallel execution cost by roughly 60% (one prompt computation + four cache hits instead of five full computations).
6. Background Review Cache Inheritance
The asynchronous background‑review task also reuses the cached system prompt, creating a chain of cache hits across the main session, sub‑agents, and the review agent.
# agent/background_review.py
def run_background_review(agent, conversation_history):
review_agent = AIAgent(
system_message=agent._cached_system_prompt, # inheritance!
...
)
review_agent.run("请回顾本次对话,提炼可复用的技能和经验。")Result: a main session plus three sub‑agents plus one review task achieve the same cache‑hit pattern, cutting total computation cost by about 60%.
7. Cache TTL Strategy
Anthropic offers two TTL options:
5 minutes – suited for fast, single‑turn queries; standard cache price.
1 hour – for long‑running editing sessions; slightly higher cost but more stable.
Hermes defaults to a 5‑minute TTL but allows configuration:
# config.yaml
prompt_caching:
enabled: true
cache_ttl: "1h" # optional: "5m" (default) or "1h"The default favors high‑frequency interactions typical of IDE plugins or terminal chats, where round‑to‑round gaps are usually under five minutes.
8. Design Insights: Cache‑Friendliness as an Architectural Principle
System prompt built once → shared string across the session; rebuilding each round would miss the cache entirely.
Three‑layer prompt architecture → even the “volatile” part stays fixed, preventing cache misses when memory updates.
Session restoration reads the cached prompt from DB → cross‑process cache reuse; a restart would otherwise miss the cache.
Sub‑agents inherit the prompt → shared cache prefix; independent prompts would multiply computation.
Background review inherits the prompt → asynchronous tasks also hit the cache; otherwise they would add extra cost.
These decisions embody the rule: treat “unchanged” as the default assumption and handle “change” as an exception.
9. Limitations and Outlook
Prefix caching has trade‑offs:
Memory delay : new memories become visible only in the next round.
Cache expiration : a 5‑minute TTL causes a miss if the user pauses longer.
Provider lock‑in : current implementation targets Anthropic’s Prompt Caching; other providers require separate adapters.
Hermes mitigates these via layered compensation: passing new memory through user messages, ensuring the system prompt never changes on resume, and abstracting cache handling per provider. Future work aims to evolve the strategy into a provider‑agnostic cache abstraction layer.
10. Summary
Prefix caching is Hermes’ most critical performance design; immutable system prompts are the strictest architectural discipline.
The system_and_3 policy uses four breakpoints (system prompt + last three messages) to achieve 70‑80% cache‑hit rates in long dialogs.
Sub‑agents and background review inherit the system prompt, sharing the cache prefix and reducing cost by about 60%.
Cache‑friendliness is not a post‑hoc optimization but a foundational architectural principle.
The main cost is delayed memory visibility, which Hermes alleviates through user‑message propagation and asynchronous review updates.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
James' Growth Diary
I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
