How to Compress Long LLM Conversations with Smart Summarization and Sliding Window
This article explains how to keep essential information from lengthy AI chat histories by using an intelligent summarization prompt, injecting the summary as a system message, and applying a sliding‑window strategy that retains the last three exchanges, thereby reducing token cost and preserving context continuity.
Problem: LLM short‑term memory and context window limits
Large language models receive the entire messages list for every request. As the conversation grows, token usage and latency increase, and irrelevant older turns can dilute the relevance of the model’s response.
Sliding‑window policy
A sliding‑window strategy discards messages that exceed a user‑defined maximum round count, keeping only the most recent exchanges. This prevents unbounded growth of the context while preserving immediate conversational coherence.
Intelligent summarization compression
The open‑source project OpenDeepWiki (GitHub: https://github.com/AIDotNet/OpenDeepWiki) implements a concrete “compress‑context” workflow. The core method CompressContextAsync (file src/KoalaWiki/Agents/AutoContextCompress.cs) follows these steps:
private async Task<List<ChatMessage>> CompressContextAsync(CancellationToken cancellationToken)
{
// 1. If the conversation is short, keep everything
if (messages.Count <= 5)
{
return messages;
}
// 2. Always retain the last 3 messages for immediate coherence
var keepLastCount = 3;
var messagesToAnalyze = messages.Count - keepLastCount;
var messagesToCompress = messages.Take(messagesToAnalyze).ToList();
// 3. Build a system prompt that defines the summarization role
var analysisMessages = new List<ChatMessage>
{
new ChatMessage(ChatRole.System,
"You are a conversation summarization assistant for AI development conversations. " +
"Your task is to analyze conversation history and create a concise, accurate summary. " +
"Focus on extracting the most important information while maintaining context coherence.")
};
analysisMessages.AddRange(messagesToCompress);
analysisMessages.Add(new ChatMessage(ChatRole.User, BuildCompressionPrompt()));
// 4. Call the LLM to obtain a summary
var response = await chatClient.GetResponseAsync(analysisMessages, cancellationToken: cancellationToken);
var summary = response.Text ?? "Unable to generate summary";
logger?.LogInformation("Compressed {CompressedCount} messages. Kept last {KeepCount} messages.", messagesToAnalyze, keepLastCount);
// 5. Assemble a new message list
var compressedMessages = new List<ChatMessage>();
// Preserve all original system messages (they define role, language, constraints)
var systemMessages = messages.Where(m => m.Role == ChatRole.System).ToList();
compressedMessages.AddRange(systemMessages);
// Insert the generated summary as a new system message wrapped in pseudo‑XML tags
compressedMessages.Add(new ChatMessage(ChatRole.System,
$"<conversation-history-summary>
[Summary of previous dialogue]
{summary}
</conversation-history-summary>"));
// Append the most recent user/assistant turns
var recentMessages = messages.Skip(messagesToAnalyze).ToList();
compressedMessages.AddRange(recentMessages);
messages = compressedMessages;
return messages;
}The helper BuildCompressionPrompt returns a carefully structured prompt:
private string BuildCompressionPrompt()
{
return "Please compress the above conversation history into a concise summary.
" +
"The summary should:
" +
"1. Retain key information and context (file paths, class names, function names, code snippets, initial task goals).
" +
"2. Record important decisions and outcomes (technical choices, bug fixes, modified files, test results).
" +
"3. Preserve the final task requirements and pending issues.
" +
"4. Use concise language, not exceeding 1/3 of the original length, and present in bullet points.
" +
"5. Provide ONLY the summary without any additional explanations, meta‑commentary, or preamble. Start directly with the summarized content.";
}Key design considerations
Preserve system messages – System‑role messages encode the AI’s persona and constraints; dropping them can cause the model to lose its task definition.
Inject compressed summary – The summary is wrapped in <conversation-history-summary> tags and added as a new system message, making it easy for downstream code to recognise historical context.
Sliding window – keepLastCount = 3 guarantees that the three most recent turns remain untouched, preserving immediate conversational flow.
Prompt engineering techniques
Role & domain specificity – The system prompt explicitly mentions “AI development conversations”, steering the model to weight code‑related details higher.
Information hierarchy – The user prompt splits extraction into four dimensions (key info, decisions, outcomes, pending tasks), ensuring hard technical facts are retained.
Quantitative constraints – Limit summary length to ≤ 1/3 of the original and require bullet points, producing a compact, token‑efficient output.
Negative constraints – Explicitly forbid any preamble or explanatory text, preventing the model from adding filler (“Sure, here is the summary…”).
Structural tagging – Pseudo‑XML tags allow programs to parse the summary without ambiguity and signal to the LLM that this block is historical context, not current input.
Rationale and trade‑offs
Retaining system messages avoids “role loss” where the assistant might drift from its intended behavior. The sliding‑window alone would discard older technical details; the summarization step distills those details into a concise representation, keeping the model’s cognition continuous while staying within the context window. The quantitative length constraint balances token savings against information loss – a 1/3 ratio was chosen empirically to keep most identifiers (file paths, class names) while cutting redundant dialogue. Negative constraints eliminate hallucinated explanations, which is crucial for automated pipelines that consume the summary directly.
Outcome
The combined “summary compression + sliding window” approach reduces token consumption, shortens latency, and maintains the fidelity of technical context, enabling long‑running AI‑assisted development sessions without the model forgetting earlier decisions or code artifacts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
