Artificial Intelligence 17 min read

Why Compression Isn’t Truncation: Hermes’s Structured Summaries Keep Prefix Cache Hits

The article explains how Hermes Agent avoids the pitfalls of naive sliding‑window truncation—such as orphaned tool calls and broken KV‑cache—by using a three‑segment protection scheme, cheap tool‑result pre‑pruning, and a structured, reference‑only summary that dramatically reduces tokens while preserving and even improving prefix cache hit rates.

James' Growth Diary

Jun 25, 2026

Why Compression Isn’t Truncation: Hermes’s Structured Summaries Keep Prefix Cache Hits

James revisits the prefix‑caching principle that keeping the system prompt and early dialogue unchanged allows LLM services (OpenAI, Anthropic, Google) to reuse KV cache and cut latency by 50‑80%.

01 | Background: Cost of Context Overflow

When an agent’s token count exceeds about 50% of the model window, the naive solution is a sliding‑window truncation that drops the oldest messages. This causes two fatal issues:

Orphaned tool_call / tool_result pairs that make the API return 400 errors.

Cache breakage because the prefix after the system prompt changes, invalidating the cached KV.

Hermes therefore adopts a third path: a three‑segment protection plus structured summarization.

02 | Counter‑Intuitive 1: Orphan Problem and Three‑Segment Design

Agent conversations consist of tool_call → tool_result pairs. Truncating can delete a call while leaving its result (or vice‑versa), leading to API errors. Hermes solves this with a three‑segment protection:

compress_start = self._protect_head_size(messages)  # protect head
compress_end   = self._find_tail_cut_by_tokens(messages, compress_start)  # token budget defines tail
turns_to_summarize = messages[compress_start:compress_end]  # middle segment → summary

Head : first N messages (system prompt + initial dialogue) – kept unchanged.

Middle : all intermediate messages – compressed into a structured summary.

Tail : most recent ~20K tokens – kept unchanged.

The summary is inserted between head and tail, ensuring the system prompt stays at position 0 and the early dialogue remains stable, thus preserving prefix cache hits.

03 | Tool‑Result Pre‑Pruning: Cheap First Filter

Before invoking an LLM for summarization, Hermes runs a lightweight pre‑pruning step that replaces raw tool outputs with concise one‑line abstracts. Example transformations: npm test (847 lines) → [terminal] ran `npm test` → exit 0, 847 lines output Read a 12,000‑character file → [read_file] read config.py from line 1 (12,000 chars) Web search result of 3,500 chars →

[web_search] query='context compression' (3,500 chars result)

This reduces the middle segment size by 60‑80% without any LLM cost.

04 | Structured Summary: Template and Reference‑Only Isolation

Hermes uses a strict template with four sections:

## Historical Task Snapshot
## Historical In-Progress State
## Historical Pending User Asks
## Historical Remaining Work

Each section has a clear semantic meaning, and a preamble marks the summary as REFERENCE ONLY so the LLM treats it as background, not as active instructions. The preamble also emphasizes that the latest user message after the summary is the only source of truth.

05 | Iterative Summarization: Preserving Prefix Stability

Instead of re‑summarizing the entire history on each compression round, Hermes updates the existing summary with only the newly generated middle segment. Pseudocode:

# after first compression
self._previous_summary = summary_body
# on subsequent compression
if self._previous_summary:
    prompt = f"""You are updating a context compaction summary.
Previous summary:
{self._previous_summary}

New turns to incorporate:
{content_to_summarize}
..."""

Benefits:

Lower LLM cost (only new tokens are processed).

Stable wording across rounds, keeping the cache‑friendly prefix.

Higher cache reuse because the same summary text reappears.

06 | Counter‑Intuitive 2: Summaries Save More Than Tokens

Benchmarks show that after compression the total token count drops from ~80K to ~25K, the prefix‑token proportion rises from 2.5% to 20%, request latency falls from ~3 s to ~1.2 s, and cached prefix tokens increase from 2K to 5K. The summary itself becomes a cacheable prefix, so cache hit tokens actually increase.

07 | Common Pitfalls

Leaving orphaned tool_call / tool_result pairs – Hermes cleans them with _sanitize_tool_pairs().

Too short a preamble – the current 200+ token preamble prevents the LLM from treating summary items as active commands.

Cross‑session summary leakage – Hermes clears _previous_summary on session end or reset.

Insufficient summary token budget – Hermes caps the budget at 12K tokens and scales it to 20% of the compressed segment.

08 | Conclusion

Compression is not the same as truncation. By protecting the head, summarizing the middle with a structured, reference‑only template, and iteratively updating the summary, Hermes achieves both token savings and a higher prefix‑cache hit rate, yielding lower latency and cheaper LLM usage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM context compression Hermes Agent prefix caching structured summarization tool call management

Written by

James' Growth Diary

I am James, focusing on AI Agent learning and growth. I continuously update two series: “AI Agent Mastery Path,” which systematically outlines core theories and practices of agents, and “Claude Code Design Philosophy,” which deeply analyzes the design thinking behind top AI tools. Helping you build a solid foundation in the AI era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.