Artificial Intelligence 23 min read

Agent Context Compaction: How pi and Claude Code Implement Compression Strategies

The article analyzes context compaction for long‑running LLM agents, comparing pi‑mono and Claude Code approaches, detailing when, where, and how to compress, trigger mechanisms, multi‑step summarization pipelines, storage formats, reconstruction methods, and the trade‑offs between cost, latency, and summary quality.

AI Engineer Programming

Apr 15, 2026

Agent Context Compaction: How pi and Claude Code Implement Compression Strategies

Context Compaction Overview

LLM context windows are finite; as a session progresses, conversation history, tool call results, and file contents accumulate and eventually exceed the limit. Compaction therefore trades information loss against the ability to continue the session.

1. pi‑mono Compaction Strategy

1.1 Overall Design

pi‑mono targets long‑running sessions that may exhaust the context window. Older messages are summarized while recent messages are kept intact. The compression is lossy, but the complete audit history is retained in a JSONL file that can be accessed via the /tree command, separating “active context for inference” from “full historical record”.

1.2 Trigger Mechanism – Dual‑Mode

Automatic compression runs when

contextTokens > model.contextWindow - settings.compaction.reserveTokens

. The compaction_start event records the reason as "manual" (user‑initiated), "threshold" (proactive near limit), or "overflow" (API returned an overflow error). When the reason is "overflow" and compression succeeds, willRetry is set to true, causing the agent to automatically retry the truncated prompt.

interface Settings {
  compaction?: {
    enabled?: boolean   // default true, toggled via /autocompact
    reserveTokens?: number   // default 16384, space reserved for LLM response
    keepRecentTokens?: number   // default 20000, token budget for recent messages
  }
}

1.3 Four‑Step Compression Algorithm

Step 1 – Find Cut Point : Scan backward from the newest message, accumulating token estimates until reaching keepRecentTokens (default 20 K). The cut usually occurs at a turn boundary (user message plus all assistant responses and tool calls up to the next user message). If a single turn exceeds the budget, the cut point falls inside that turn’s assistant messages.

Step 2 – Extract Messages : Collect all messages from the previous compression’s retained boundary (or from session start) up to the cut point.

Step 3 – Generate Summary : Call an LLM with a structured prompt to produce a summary; if a previous summary exists, it is passed as iterative context.

Step 4 – Append Entry & Reload : Store a CompactionEntry containing the summary and firstKeptEntryId. When the session reloads, the agent reconstructs context by concatenating the summary with messages starting from firstKeptEntryId.

1.4 Iterative Compression & Boundary Rules

During repeated compressions, the span to be summarized starts from the prior compression’s retained boundary ( firstKeptEntryId) rather than from the compression entry itself, ensuring continuity across multiple compressions. Before writing a new CompactionEntry, pi‑mono recomputes tokensBefore from the rebuilt session so token counts remain accurate. The second‑compression keepLastMessages count includes only messages between the two compressions; if the count exceeds available messages, all messages are kept. Compression never crosses a previous compression boundary.

1.5 Summary Prompt & Tool Result Truncation

The default prompt “CONTEXT CHECKPOINT COMPACTION” asks the LLM to produce a concise, structured hand‑off summary containing current progress, key decisions, important context and constraints, absolute file paths, clear next steps, and any data needed for the next LLM to continue seamlessly.

Serialized tool results are truncated to 2000 characters; excess characters are replaced by a placeholder indicating the number removed. This keeps the summary request within a reasonable token budget because tool results (especially read and bash) are often the largest token contributors.

Summarization is performed by invoking pi‑ai directly (no tool calls, inference disabled) with maxTokens ≈ 13107 when reserveTokens = 16384.

1.6 Branch Summarization

When navigating to a different branch via /tree, pi‑mono generates a summary of the work left behind and injects that summary into the new branch’s context. The algorithm finds the deepest common ancestor node, summarizes messages beyond that ancestor on the left branch, and inserts the summary into the right branch.

1.7 Extensibility – Hook System

Extensions can intercept compression via the SessionBeforeCompactEvent, cancel it, or provide a custom summary. Extensions may store arbitrary JSON‑serializable data in the details field of a CompactionEntry. By default, compression tracks file operations ( readFiles, modifiedFiles), but custom implementations can store any structure. After compression ends, the auto_compaction_end event is emitted through AgentSession._emit(), propagating to the embedded runner and incrementing the compaction count.

2. Claude Code Compaction Strategy

2.1 Overall Design

Claude Code implements a three‑tier hierarchical architecture:

user triggers /compact
↓
[Tier 1] Session Memory Compaction
↓ (fallback if unavailable)
[Tier 2] Microcompact
↓ (always runs before Tier 3)
[Tier 3] Traditional Compaction (full LLM summary)
↓
Re‑assemble key context

The principle is “cheapest first, most expensive last”.

Key parameters: effectiveWindow = modelWindow – reserveTokens. Automatic trigger occurs when effectiveWindow – 13 K tokens is reached; manual /compact is blocked at effectiveWindow – 3 K tokens. Fixed token buffers (13 K, 3 K) ensure consistent calculations across different window sizes (e.g., 200 K window reserves 33 K, 1 M window reserves 33 K).

2.2 Trigger Mechanism

Because the automatic threshold sits close to the hard context limit, rapid token growth (large file reads, verbose tool output) can jump straight to the hard limit, causing both automatic compression and manual /compact to fail—the compression request itself may not fit into the context window.

2.3 Tier 1 – Session Memory Compaction

Claude Code maintains a session‑memory file (a structured Markdown document) that tracks what has happened. When /compact runs, the file is read, already‑summarized sections are identified, recent messages are kept, and oversized sections are truncated (each part limited to 2000 tokens, retaining at least five text‑bearing messages, roughly 10 K–40 K tokens). If this reconstruction succeeds, no LLM call is made; otherwise the system falls back to a full summary.

This layer provides zero API cost because it reuses the pre‑maintained summary file.

2.4 Tier 2 – Microcompact

Microcompact walks the message history and strips high‑token, low‑information content from older turns: it removes base64 data from image blocks and truncates excessively long tool outputs that are no longer in the model’s working memory.

When the prompt cache is hot, Claude Code does not edit messages locally; instead it sends a cache_edits block with the API request, instructing the server to delete specific tool_use_id blocks while preserving the cache prefix. This achieves near‑zero cache‑invalidating cost.

2.5 Tier 3 – Traditional Compaction (Full LLM Summary)

The final tier uses an LLM to generate a structured summary. The compression prompt requires nine sections:

Primary Request and Intent

Key Technical Concepts

Files and Code Sections

Errors and fixes

Problem Solving

All user messages

Pending Tasks

Current Work

Optional Next Step

The prompt also mandates verbatim quoting of any next step to prevent task drift.

Compression requests explicitly disable tools ( tools: []) so the model cannot invoke read or bash during summarization. Tool‑use/result pairs are never split; the start and end points of compression always respect whole tool call/result pairs, preserving dialogue consistency.

2.6 Post‑Compression Context Reconstruction

After generating the summary, Claude Code rebuilds the dialogue in the following order:

Compression boundary markers (metadata for UI)

Summary itself (≈ 10 K tokens)

Restored files (up to 5 files, each ≤ 5 K tokens)

Restored skills (≤ 25 K tokens total)

Restored memory files (e.g., CLAUDE.md, MEMORY.md, active plans)

Hook results (custom context‑recovery scripts)

Recent messages (≈ 20 full messages)

This design ensures the agent can immediately resume productive work after compression.

2.7 Experimental Strategies (Feature Flags)

Two experimental strategies exist in the source:

Context Collapse : progressively compresses older dialogue fragments while keeping recent context clear. Controlled by the CONTEXT_COLLAPSE flag, it introduces persistent entry types ( ContextCollapseCommitEntry, ContextCollapseSnapshotEntry) that survive session restarts.

Reactive Compact : when the REACTIVE_COMPACT flag is enabled, the dialogue is divided into “compact groups”. Each group is summarized with an independent LLM call, and the group summary replaces the original messages, preserving semantic cues.

3. Design Trade‑Off Analysis

3.1 Trigger Strategies: Reactive vs Proactive

pi‑mono’s dual‑mode trigger (threshold + overflow) converts a context overflow error into an internal recovery path; overflow automatically retries the truncated prompt, making the failure invisible to the user.

Claude Code’s fixed 13 K buffer is more conservative but can miss the window when token growth is rapid, causing both automatic and manual compression to fail—a tension between fixed token buffers and variable content growth.

3.2 Compression Cost Layering

Claude Code’s three‑tier architecture exemplifies cost layering: Tier 1 (zero cost), Tier 2 (local operations + cache_edits with no API cost), Tier 3 (expensive LLM call). pi‑mono relies on a single LLM‑based path, resulting in higher API cost in high‑frequency compression scenarios.

The cache_edits mechanism is especially clever: when the prompt cache is hot, the server precisely deletes tool‑result blocks without breaking the cache prefix, achieving near‑zero cache‑invalidating cost. pi‑mono’s tool‑result truncation occurs client‑side before the summary request and does not protect the cache.

3.3 Summary Quality & Drift Prevention

Claude Code’s nine‑section structured prompt and verbatim next‑step quoting directly address “task drift” – the gradual deviation of the agent’s understanding from the original intent after multiple compressions. By forcing inclusion of all user messages and exact next‑step text, the prompt builds a strong safeguard against drift.

pi‑mono’s iterative compression (using the previous summary as context for the next) offers a different continuity guarantee: each new summary builds on the prior one, theoretically preserving semantic coherence across many compressions.

3.4 Reconstruction Strategies: Minimal vs Proactive

pi‑mono’s reconstruction is minimal: after the summary, it appends the original messages starting from firstKeptEntryId without adding extra context.

Claude Code’s reconstruction is proactive: beyond the summary, it injects restored files, skills, memory files, and hook results, acknowledging that an agent often cannot resume efficiently with only a summary. This merges “memory recovery” with “work‑state recovery” in a single step.

4. Core Constraints Shared by Both Implementations

Tool‑pair integrity must be preserved; compression never splits a tool_use from its corresponding tool_result.

Lossy compression coexists with complete historical retention (JSONL for pi‑mono, session‑memory file for Claude Code).

Summary prompts dictate which information survives the compaction process.

References

https://github.com/badlogic/pi-mono/blob/main/packages/coding-agent/docs/compaction.md

https://platform.claude.com/cookbook/misc-session-memory-compaction

https://barazany.dev/blog/claude-codes-compaction-engine

https://blog.kubesimplify.com/claude-code-leak-what-the-source-actually-teaches

https://zread.ai/badlogic/pi-mono/19-context-compaction-strategy

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Agent Claude Code context compaction pi-mono

Written by

AI Engineer Programming

In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Context Compaction Overview

1. pi‑mono Compaction Strategy

1.1 Overall Design

1.2 Trigger Mechanism – Dual‑Mode

1.3 Four‑Step Compression Algorithm

1.4 Iterative Compression & Boundary Rules

1.5 Summary Prompt & Tool Result Truncation

1.6 Branch Summarization

1.7 Extensibility – Hook System

2. Claude Code Compaction Strategy

2.1 Overall Design

2.2 Trigger Mechanism

2.3 Tier 1 – Session Memory Compaction

2.4 Tier 2 – Microcompact

2.5 Tier 3 – Traditional Compaction (Full LLM Summary)

2.6 Post‑Compression Context Reconstruction

2.7 Experimental Strategies (Feature Flags)

3. Design Trade‑Off Analysis

3.1 Trigger Strategies: Reactive vs Proactive

3.2 Compression Cost Layering

3.3 Summary Quality & Drift Prevention

3.4 Reconstruction Strategies: Minimal vs Proactive

4. Core Constraints Shared by Both Implementations

References

AI Engineer Programming

How this landed with the community

Was this worth your time?

0 Comments

2.3 Tier 1 – Session Memory Compaction

2.4 Tier 2 – Microcompact

2.5 Tier 3 – Traditional Compaction (Full LLM Summary)