Context Window Strategies in Agent Harnesses: Pi, OpenClaw, Claude Code, Letta, Alyx
The article analyzes how five Agent Harness frameworks—Pi, OpenClaw, Claude Code, Letta, and Alyx—handle context windows, file pagination, tool result limits, session pruning, and sub‑agent isolation, revealing convergent design patterns that treat the context as a managed memory system.
1 Context Window Management Is No Longer Just a Prompt Issue
Agent Harnesses face a core engineering problem: the context window is too small to hold everything a model might need to remember. As sessions grow, file reads expand, sub‑agents multiply, and tool outputs accumulate, the Harness must decide what stays in the work set, what gets compressed, and what is deferred for later retrieval.
2 Core Bet: Trust the Model to Manage Its Own Context
Each context‑management decision carries an assumption about model behavior. The key question is whether the Harness should actively constrain context usage or rely on the model to manage the budget correctly.
3 Large‑File Context Management
When a file is too large for the context window, all four Harnesses use offset and limit parameters to paginate reads.
3.1 Pi (pi‑mono)
Pi imposes a hard ceiling of 2,000 lines or 50 KB, whichever comes first, truncating from the head and appending a clear continuation prompt:
[Showing lines 1-2000 of 50000. Use offset=2001 to continue.]The tool also repeats the notice: “output is truncated to 2000 lines or 50KB. Use offset/limit for large files.” Pi’s approach is Harness‑first: the framework protects you before teaching the model how to paginate.
3.2 OpenClaw
OpenClaw inherits Pi’s 2,000‑line/50 KB truncation and adds limits for bootstrap files: a maximum of 12,000 characters per file and 60,000 characters total. When the bootstrap budget is exceeded, it keeps 75 % of the head and 25 % of the tail. Tool results have a separate budget of 16,000 characters or 30 % of the context window, whichever is smaller. If the tail appears important (e.g., errors, JSON closing brackets, “summary” keywords), OpenClaw switches to a head‑and‑tail retention mode; otherwise it keeps only the head.
OpenClaw’s strategy is layered defense: Pi’s truncation, then bootstrap limits, then tool‑result budgeting.
3.3 Claude Code
Claude Code applies two defensive gates. Before opening a file it checks a 256 KB byte limit; files larger than this are rejected with an error prompting the model to use offset / limit or grep. After reading, the output is re‑counted against a 25,000‑token budget to catch high‑density files. Both limits are adjustable via Anthropic’s GrowthBook feature flags without a new release.
Even when a file is below the byte limit, the tool returns only the first 2,000 lines, truncating any line longer than 2,000 characters. The model must explicitly request more via offset / limit. The tool description itself acts as a prompt, explaining pagination, size limits, supported formats (images, PDFs, notebooks), and encouraging parallel file reads. A conditional instruction can expose the 256 KB limit directly in the prompt.
Claude Code also de‑duplicates reads: repeated reads of an unchanged file return a stub instead of the full content, saving tokens.
Claude Code’s approach is Harness‑first with remote adjustability: pre‑read byte gate, post‑read token gate, default line/character limits, executable error messages, rich tool prompts, read de‑duplication, and server‑side feature flags.
3.4 Letta
Letta takes a fundamentally different route. Uploaded files are parsed, chunked, and embedded into a vector store, giving the Agent both exact and semantic search capabilities. It offers three file tools: open_files – view raw text grep_files – exact pattern matching on raw text semantic_search_files – semantic retrieval on embeddings
When a file is “opened” in the Agent’s context, its visible content is truncated to a per‑file character limit that scales with model context size (5 000 char at 8K context, 15 000 at 32K, 25 000 at 128K, 40 000 above 200K). The number of simultaneously opened files also scales (up to 3 for small models, 15 for large, default 5). Exceeding limits triggers an LRU eviction of the least‑recently used file.
Letta’s philosophy is memory‑first: the file exists both as raw text and as an embedding, while the context window only shows a managed view. The model accesses additional content through tools.
4 The Real Engineering Challenge: Session Pruning
As conversations lengthen, each Harness must decide what to keep and what to discard, making compression strategies crucial for long‑running Agents to remain coherent.
4.1 Pi
Pi uses LLM‑driven compaction triggered when estimated context tokens exceed contextWindow - reserveTokens (default reserve = 16,384 tokens). It keeps the most recent ~20,000 tokens ( keepRecentTokens) and sends older content to the LLM for summarization. The summary is inserted as a synthetic user message before the retained tail. Tool calls and results are never orphaned; the system rolls back to preserve paired boundaries.
4.2 OpenClaw
Building on Pi’s compaction, OpenClaw adds two mechanisms. It triggers when history exceeds 50 % of the context window ( maxHistoryShare = 0.5). History is split into token‑balanced chunks; the oldest chunk is dropped, while remaining chunks are retained after repairing any orphaned tool‑call/result pairs. Summaries of discarded chunks are generated in multiple LLM passes with merge steps. Summaries are inserted as synthetic user messages, just like Pi.
Before compression, a silent agentic turn persists the Agent’s state to memory files. Then a second layer performs non‑destructive memory trimming of tool results (soft‑trim then hard‑clear) and sets a 5‑minute TTL cache, protecting persistent dialogue while freeing context for the current request.
4.3 Claude Code
Claude Code runs a pre‑query optimization and LLM‑driven compaction. Compaction triggers when estimated tokens exceed the context window minus a 13,000‑token buffer (≈167 K tokens for a 200 K context model). The entire conversation is sent to the model with a structured nine‑part prompt covering primary request, key concepts, files/code, errors, problem‑solving steps, all user messages, TODOs, current work, and optional next steps. The summary becomes a user message indicating the session continues from a previously exhausted conversation.
After compression, up to five recently read files are re‑attached within the token budget. Summarization safety is ensured by generating an analysis scratchpad and final summary in separate tag blocks; the scratchpad is stripped before the summary re‑enters the context.
If the compaction call itself hits the context limit, a deterministic head‑drop removes the oldest API round groups (20 % of groups or enough to fill the token gap).
Claude Code also runs a pre‑query optimization on every API call, regardless of context pressure. Overly large tool results are persisted to disk and replaced with a 2 KB preview. Single‑tool limits are 50,000 characters; aggregated message limits are 200,000 characters, ensuring even a 60 KB grep result is moved out of context in the first round.
4.4 Letta
Letta employs multiple compaction strategies and a two‑stage fallback summarizer when the main path overflows. Compaction triggers when estimated usage exceeds 90 % of the context window. A sliding‑window eviction starts at 30 % of messages and increments by 10 % until usage falls below the target, preserving newest messages and discarding oldest.
Self‑compact mode uses the Agent’s own model to generate summaries, avoiding a separate summarizer cost. If the summarizer overflows, a two‑stage fallback first trims tool returns to 5,000 characters and retries; if still overflowing, it truncates the transcript to keep 30 % head and 30 % tail, dropping the middle.
A separate 75 % memory‑warning threshold alerts before the 90 % compaction trigger.
5 Sub‑Agent Context Management
Across the examined Harnesses, sub‑agents are isolated from the parent session; none copy the full parent transcript into the sub‑agent. The key question is how much workspace context a sub‑agent inherits.
5.1 Sub‑Agent
Pi spawns a new process for each delegated task, using only the task string as the user message; no parent history is passed.
OpenClaw gives sub‑agents a fresh isolated session by default. A fork mode can copy the parent transcript for same‑type Agent spawns, but the workspace context is filtered to a minimal allowlist (AGENTS.md, TOOLS.md, SOUL.md).
Claude Code has two paths. The default typed‑agent path creates an empty dialogue with the delegation prompt as the sole user message. The newer fork path copies the full parent history, adds a synthetic assistant message and placeholder tool result, and grants an explicit allowlist of skills (Read, Grep, Glob, Shell, Edit, Write, WebSearch, etc.). Skills are pre‑loaded and injected as user messages.
Letta does not fork for ordinary tool execution; tools run in the main Agent loop. History is accessed via dedicated search tools: conversation search for recall memory and archival memory search for the embedding store.
6 Why These Designs Converge
Comparing the four codebases reveals striking similarity rather than difference. Agent engineering is reframing the “context window is too small” problem into a classic systems issue: managing a fixed‑size work set.
All four Harnesses set hard limits on file reads, support offset/limit pagination, cap tool result sizes, isolate sub‑agent sessions, run LLM‑driven compaction at token thresholds, and estimate context usage to detect pressure. These convergences are not coincidental; they solve the same engineering challenge.
Specific design choices also rhyme: Pi and OpenClaw truncate file reads at the head and add continuation prompts; Claude Code and OpenClaw persist oversized tool results to disk. Pi, OpenClaw, and Claude Code all ensure tool call/result boundaries remain safe during compaction. Three systems allow parent transcript forking into sub‑agents. Even Alyx, a data‑exploration assistant unrelated to code editing, independently arrived at the same patterns: limiting tool results to a token budget, using binary search to fit the largest slice, de‑duplicating repeated previews, splitting large JSON payloads into compressed previews with server‑side full copies, estimating token pressure with a char/4 heuristic, and checkpointing at 50,000 tokens.
Fifty years of computer history shows that the best memory management is invisible to the program—registers, cache lines, page tables, swap. Agent Harnesses are moving toward the same goal: delivering the right work set to the model at the right time while letting the model dynamically manage its own context.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Tech Publishing
In the fast-evolving AI era, we thoroughly explain stable technical foundations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
