Why Do AI Agents Forget and Hallucinate? A Complete Guide to KV‑Cache Memory Mechanisms
The article explains that AI agents’ forgetting and hallucinations stem from token‑level attention scores causing key‑value cache eviction before retrieval, then surveys KV‑cache basics, naive growth, streaming‑LLM windowing, SnapKV’s attention‑guided compression, token‑retention studies, Memory Sparse Attention, compares these methods, and discusses practical system pitfalls and design implications.
1. Conclusion: Memory issues are not always retrieval problems
Your AI agent often forgets important context and even hallucinates details. This is usually not because Retrieval‑Augmented Generation (RAG) failed to retrieve anything or the external memory is poorly designed. The problem occurs earlier: critical tokens are dropped by the attention mechanism and KV cache.
Most memory problems arise before retrieval. At around the 4096‑th token, a key piece of information silently disappears from the cache and nothing restores it.
Every time the model processes a token it runs attention, scoring each token. High‑score tokens are kept; low‑score tokens are evicted when the cache is full.
This simple mechanism, without any external system, determines what the model continues to attend to.
KV‑cache compression papers repeatedly emphasize that the issue is not speed but what the model is allowed to remember. This is more a design problem for agents than an infrastructure issue.
2. What is KV Cache?
When a Transformer processes a sentence, it computes a (K, V) pair for each token and stores them. New tokens attend over all stored (K, V) pairs to decide the next output. Caching avoids recomputing the entire context at every step.
The problem is simple yet brutal:
Each new token permanently adds a (K, V) pair to the cache.
The cache size grows linearly with context length.
Running a 128k‑token context on a large model can consume several gigabytes per layer per head.
Hardware does not care about your use case; it simply runs out of memory.
For AI agents this is a daily reality: every tool call, retrieved document, and dialogue turn consumes part of the budget, and once the budget is full something must be discarded.
Problem: what to discard?
Think of it like phone storage:
You travel and keep taking photos.
At some point the phone warns that storage is full.
You can either stop taking photos or delete some.
Most naive systems simply stop.
A smarter system asks: what do you really need to keep?
# Simplest KV cache: never evict, just grow
kv_cache = []
def cache_token(key, value):
kv_cache.append((key, value))
# No scoring, no eviction, no budget check
# After 128k tokens the list becomes huge
print(f"cache size: {len(kv_cache)} pairs")
print(f"approx memory: {len(kv_cache) * 2 * 4096 * 2 / 1e9:.2f} GB")
# For a 70B model with 128k tokens, cache alone ≈ 64 GB3. First Approach: Keep Only the Most Recent Content
The most direct method is StreamingLLM. It keeps the first few tokens (attention sinks important for stability) and a sliding window of recent tokens, discarding everything in between. This fixed‑budget rule is simple.
It works: the model can run indefinitely without crashing due to cache overflow. The cost is that everything that happened thousands of tokens earlier is essentially lost. If an instruction appears at the beginning of the context and a later tool call pushes it out of the window, the model forgets its goal.
"Recent = important" is a harmful assumption; a directive from 40 k tokens ago may be the most critical piece of information.
4. SnapKV: Let the Model Identify Important Tokens
SnapKV changes the way I think about memory. It observes that each attention head tends to focus on the same class of tokens throughout generation.
This is not random.
Layer 7, head 4 often attends to task‑instruction tokens.
Layer 3, head 2 often attends to entity names.
This pattern is stable from the first output token to the last.
SnapKV asks: if the pattern is stable, why not use it to decide what to keep?
It defines an observation window – the last segment of the prompt – and looks for tokens that receive high attention scores inside this window. Tokens from earlier context that score highly become "heavy hitters" and are retained, while the rest are compressed.
It also adds a clustering step: simply picking the highest‑score tokens would produce isolated fragments. By max‑pool clustering, neighboring tokens of a heavy hitter are also kept, preserving enough context for understanding.
Results (with 1024 cache slots):
Compression ratio of 92 %.
Generation speed ↑ 3.6×.
Memory efficiency ↑ 8.2× for 16k‑token scenarios.
Runs 380k‑token context on a single A100.
Accuracy on a needle‑in‑a‑haystack test barely drops.
# Simplified SnapKV: select tokens per head
def snapkv_select(keys, values, obs_window_size=16, budget=1024):
obs_queries = keys[-obs_window_size:] # observation window
attn_scores = obs_queries @ keys[:-obs_window_size].T # (obs, seq_len)
token_votes = attn_scores.sum(dim=0) # aggregate votes
topk_idx = token_votes.topk(budget).indices # top‑k heavy hitters
# clustering: bring in neighbors
cluster_idx = expand_with_neighbors(topk_idx, kernel_size=5)
kept_keys = keys[cluster_idx]
kept_values = values[cluster_idx]
# new cache = selected prefix + full observation window
return concat(kept_keys, keys[-obs_window_size:]), \
concat(kept_values, values[-obs_window_size:])The crucial test for agents is whether a fact buried deep in a long context can still be retrieved. SnapKV answers yes – even after discarding 92 % of the cache, the model can still locate the fact, demonstrating both faster inference and better constrained working memory.
5. Cache What Lasts: Which Tokens Truly Survive?
SnapKV solves "how to compress". The token‑retention paper pushes further, asking why some tokens are always important.
It finds that certain tokens receive high attention across all layers and heads throughout generation – they are global, stable heavy hitters.
The insight: the model already computes these scores during every attention operation. Token‑retention makes this implicit ranking explicit and uses it for eviction decisions.
Intuition: like remembering the title, key names, and surprising numbers after reading a long document, the brain (or model) keeps high‑signal parts.
Recent ≠ important: a token 60 k positions ago can score higher than one 10 k positions ago.
Importance is measurable: cumulative attention weight is stable and reliable.
Eviction should use this score, not position, age, or random sampling.
Budget is fixed, but what fills it should be chosen intelligently.
def compute_token_importance(attention_weights):
# attention_weights shape: (layers, heads, seq_len, seq_len)
cumulative = attention_weights.sum(dim=(0,1,2)) # shape: (seq_len,)
return cumulative
scores = compute_token_importance(attn_weights)
keep = scores.topk(budget).indices
evict = scores.topk(seq_len - budget, largest=False).indices6. Memory Sparse Attention: Handling Long‑Term Memory in the Attention Layer
Memory Sparse Attention (MSA) approaches the problem from another angle. While earlier papers ask "what to keep in the cache?", MSA asks "what should the model attend to when computing each new token?"
Full attention is O(n²) and infeasible for 100 M tokens.
MSA combines top‑k token selection with sparse‑attention patterns.
It achieves near‑linear complexity while remaining end‑to‑end trainable.
The part most relevant to agents is Memory Interleave: agents process not a single huge document but a sequence of cross‑session items (tool outputs, retrieved docs, previous user messages).
MSA can handle multi‑hop reasoning across discontinuous context fragments, solving the memory problem inside the attention layer rather than relying on external retrieval.
Top‑k selection: each generation step focuses on the most relevant tokens.
Document‑wise positional encoding: works across non‑contiguous memory pieces.
2 GPU setup achieves 100 M token throughput – a real‑system claim.
Outperforms RAG systems on long‑context benchmarks: retrieval happens effectively inside the model.
7. Comparing the Methods
There are now dozens of important papers sharing the intuition that token importance is predictable; they differ in scoring and budgeting strategies. The author has benchmarked several methods in a notebook.
8. Why This Changes Agent Memory Design
Most discussions start at the retrieval layer (vector databases, situational memory, summarisation). Those inputs, however, come from a model that has already decided what to keep in its KV cache.
If a critical token is evicted before retrieval, it never returns. No RAG process can bring it back, and no memory manager can fix it. The model simply never saw it, leading to false‑memory propagation (FMP) that occurs before any external system intervenes.
Imagine a SaaS‑product customer‑support agent. After a two‑hour troubleshooting session, the initial configuration details may be evicted before the 15th tool call, causing the model to suggest actions that ignore the original constraints. The error is not a bad retrieval; the correct token vanished earlier.
9. Problems That Appear in Real Systems
Middle‑information loss: most eviction strategies favour recent or earliest tokens, causing loss of intermediate context.
Tool‑output dilution: a large tool response pushes earlier instructions out of the effective window.
RAG injection waste: you retrieve many passages but sparse attention only processes a few, wasting budget.
System‑prompt amnesia: key constraints in the system prompt are evicted mid‑generation, leading the model to invent replacements.
FMP from attention layer: a hallucinated intermediate token receives high attention, survives eviction, and contaminates later generation.
Cross‑session cold start: KV state cannot persist across sessions, forcing each session to start from scratch and rely on lossy retrieval.
10. Conclusion: Token‑Level Memory Is the Foundation, Not the Ceiling
The ultimate goal is a Transformer that can dynamically and intelligently manage its memory budget, rather than offloading everything to external systems. This yields not only faster inference but also a stronger memory base for every agent.
Head‑wise retention: different heads care about different tokens; eviction should respect this.
Stage‑wise compression: pre‑fill and decoding have different memory profiles and should be handled separately.
Recoverability: eviction need not be permanent.
KV‑state persistence: save cache at session end and reload for the next session, achieving cross‑episode continuity without a vector database.
Agents that can retain the right information over long horizons will outperform those that merely boast fancy retrieval pipelines. The key is whether the model itself can keep the right tokens; external memory layers can only work with what the model feeds them.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Tech Publishing
In the fast-evolving AI era, we thoroughly explain stable technical foundations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
