Artificial Intelligence 18 min read

How CodeGenius Re‑engineered Memory to Tame AI Agent Context Bloat

This article explains how the rapid evolution of AI agents caused context explosion, why the original fixed‑window memory failed, and how CodeGenius introduced a layered memory system that unloads stale data, deduplicates files, generates structural summaries, and dynamically compresses dialogue to keep prompts stable, reduce token cost, and improve task continuity.

Alibaba Cloud Developer

Jan 12, 2026

How CodeGenius Re‑engineered Memory to Tame AI Agent Context Bloat

Background

As large‑language‑model (LLM) capabilities increase, code‑focused AI agents evolve from simple chatbots to autonomous multi‑step executors. They must analyse user intent, read many files, invoke tools, and iteratively refine results. All of this information accumulates in the prompt, causing exponential context growth, higher latency, higher cost, and loss of critical signals.

Problems with Fixed‑Window Memory

Context breakage – truncating to the last few turns discards essential information.

Cache invalidation – each truncation changes the prompt, preventing reuse of cached LLM responses.

Redundant noise – repeated file contents and outdated tool outputs waste tokens and confuse the model.

Memory System Goals

Control overall context size.

Preserve key semantics.

Improve model stability.

Reduce cost and latency.

Enable truly continuous task execution.

Key Mechanisms

1. Unloading Stale Information

After several dialogue rounds, historical messages often contain redundant data (e.g., code that has already been executed). The system removes tool inputs/outputs older than five turns and stores them as external files, keeping only file paths and brief hints in the prompt. To avoid constant cache loss, unloading is batched: every five turns the oldest data is purged, balancing token reduction with cache reuse.

Claude series models charge 0.1× for reading cached tokens and 1.25× for creating cache entries.

2. File Deduplication and Summarisation

File contents dominate token usage. The strategy follows an append‑only model: the full file is sent only on the first read; subsequent edits send only diffs; otherwise only the file path is referenced. For large files (>3000 lines), tree‑sitter extracts a concise summary containing type definitions, variable declarations, and function signatures, allowing the model to fetch only the relevant sections.

interface RuleNode {
  title: string;
  key: string;
  hitRate: number;
  totalCases: number;
  nodeType: NodeTypeEnum | 'logicGroup';
  logicType?: 'AND' | 'OR';
  isLeaf?: boolean;
  children?: RuleNode[];
}

3. Dynamic Dialogue Summarisation

Even after unloading and deduplication, context still grows with each turn. The system triggers a summarisation step that collapses the entire conversation into a 2‑3 KB structured summary, preserving intent, technical concepts, file references, errors, problem‑solving steps, pending tasks, and current work. The summary follows nine sections (Primary Request, Key Concepts, Files & Code, Errors, Problem Solving, All User Messages, Pending Tasks, Current Work, Optional Next Step) and is generated using the Claude Code Compact prompt.

4. Compression Triggers

When context usage reaches ~70 % of the model window, compression runs pre‑emptively.

If a new user topic is unrelated to the existing context, the system compresses history to free space for the fresh task.

Compression only occurs when the token saving exceeds the cost of generating the summary (typically when history >3 K tokens).

Observed Benefits

Significant increase in prompt‑cache hit rate, lowering inference cost.

Improved generation quality for complex, multi‑file, multi‑step development tasks.

Average token consumption dropped due to deduplication, diff‑only updates, and structural summaries.

Future Directions

Context isolation via Sub‑Agent mechanisms to prevent unrelated tasks from contaminating the main context.

Hierarchical memory tiers: short‑term prompt, mid‑term structured summaries, long‑term external knowledge bases.

Dynamic policy optimisation that automatically adjusts thresholds and compression intensity based on context size and task complexity.

References

https://github.com/Yuyz0112/claude-code-reverse/blob/main/results/prompts/compact.prompt.md

https://drive.google.com/file/d/1QGJ-BrdiTGslS71sYH4OJoidsry3Ps9g/view

https://aider.chat/2023/10/22/repomap.html

memory optimization prompt engineering AI Agent Context Management LLM cost reduction

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.