Production-Grade Agent Memory: Compaction, Decay, and the Observation Engine

The article presents a comprehensive architecture for production‑grade autonomous agents, detailing failure modes, four distinct memory types, a nightly observation engine that turns patterns into procedural rules, tier‑aware decay scoring, context budgeting, GDPR‑compliant deletion, and a step‑by‑step maintenance pipeline.

AI Engineer Programming
AI Engineer Programming
AI Engineer Programming
Production-Grade Agent Memory: Compaction, Decay, and the Observation Engine

Why Simple Storage Is Not Enough

Storing and retrieving vectors works for RAG but fails for agents that run for weeks or months and continuously accumulate real user decisions.

Three Failure Modes

Too much : injecting everything slows the model, raises cost, and reduces accuracy.

Too little : injecting nothing makes the agent repeat mistakes and ignore learned preferences.

Wrong : injecting outdated, rejected, or irrelevant memories causes the agent to act confidently on false information.

Four Memory Types

Working Memory

Live context for the current task, kept in‑process only and discarded after the task ends. Stores task content, intermediate reasoning steps, partial tool results, and approval status. Hard limit: 4,000 tokens. If exceeded, a low‑cost model compresses the steps into a summary instead of truncating.

Episodic Memory

Timestamped logs of raw events (e.g., "Fri 15:04 – agent drafted a reply, user edited before sending"). Stored in an episodes table (append‑only) with a vector embedding for semantic search. Retrieval is hybrid: BM25 exact matches for names, amounts, dates plus cosine similarity on embeddings, then re‑scored with a decay function.

Semantic Memory

Stable facts about users, contacts, companies, and agent configuration that change slowly (tone preferences, routing rules, etc.). Fetched with direct SQL queries and injected as a structured snippet into the system prompt; no vector search is used.

Procedural Memory

Learned workflow rules derived from repeated user corrections, expressed in natural language (e.g., "Never use semicolons – the user always deletes them."). Stored in a procedures table with fields for rule text, source observation ID, promotion status, and timestamps. All active rules are injected wholesale at the start of the prompt.

The Observation Engine

Agents write raw signals (e.g., "user edited draft", "user rejected approval", "agent threw exception", "user corrected X to Y") into a queue. Nightly, the engine reads the last 30 days of episodes, prompts an LLM to extract genuine repeated patterns, and enforces strict constraints:

You are the observation engine. Analyse user behavioral data and identify genuine, repeated patterns. DO NOT invent. DO NOT generalise from a single event.
An observation is only valid if it has at least 3 consistent occurrences.

Required format for each observation:
<observation>
{
  "category": "writing-style | rhythm | people | tools | decisions",
  "agent": "email | accounting | crm | relay | files | system",
  "quote": "direct statement, max 25 words",
  "evidence": "concise phrase with supporting numbers, max 40 words",
  "occurrences": 4,
  "confidence": "low | medium | high | very-high",
  "promotion_candidate": true | false
}
</observation>

Rules:
- confidence = high only if occurrences >= 5 AND pattern consistency > 80%
- promotion_candidate = true only if the observation implies a clear action rule
- Maximum 3 new observations per run

Each new observation is deduplicated against existing ones using cosine similarity (<0.15 distance). If a near‑duplicate exists with a lower occurrence count, the existing row is updated instead of inserting a new one.

Confidence Thresholds

low      → 3‑4 occurrences, consistency < 70%
medium   → 4‑5 occurrences, consistency 70‑80%
high     → 5+ occurrences, consistency > 80%
very-high→ 8+ occurrences, consistency > 90%, no contradictions

Promotion to Procedural Rules

Confidence must be high or very‑high. promotion_candidate = true.

User has not marked the observation as "wrong".

Within 48 hours of creation the observation has not been rejected.

Upgraded rules are injected on every agent call; false positives are considered more harmful than false negatives, so thresholds are deliberately strict.

Feedback Loop

Each observation is presented to the user with "You are right" / "You are wrong" buttons. Correct feedback boosts confidence and may trigger promotion; wrong feedback marks the observation as rejected, deletes its embedding, demotes any derived procedural rule, and writes a reverse signal so the nightly detector will skip it in the future.

The No‑Delete Principle

When an episodic memory’s decay score falls below a threshold, it is not deleted . Decay measures recency of access, not behavioral importance. Deleting a low‑score episode could erase a critical pattern that becomes relevant later. Instead, the episode is compressed into a summary record while the raw row remains for audit.

Hard deletion occurs only via:

User’s GDPR deletion request.

User marks an observation as wrong (and its source episode as rejected).

Administrative action.

The Compaction Pipeline

Memory is compressed through three tiers:

Tier 0 – Raw episodes : Full detail, stored for <30 days, used for recent retrieval and audit.

Tier 1 – Weekly compaction : After 30 days, group episodes by agent × contact × week (5‑15 rows) and summarise them into a single record. Storage ratio ≈ 8:1.

Tier 2 – Monthly compaction : After 90 days, group Tier 1 summaries by month and produce a monthly overview. Storage ratio ≈ 32:1. Tier 2 records become permanent behavior abstracts.

Schema (simplified):

export const episodes = sqliteTable('episodes', {
  id: text('id').primaryKey().$defaultFn(() => crypto.randomUUID()),
  agent: text('agent').notNull(),
  eventType: text('event_type').notNull(),
  summary: text('summary'),
  outcome: text('outcome'),
  entities: text('entities'),
  compactSummary: text('compact_summary'),
  compactionTier: integer('compaction_tier').notNull().default(0),
  compactGroupId: text('compact_group_id'),
  status: text('status').notNull().default('raw'),
  lastAccessedAt: integer('last_accessed_at'),
  createdAt: integer('created_at').notNull().$defaultFn(() => Date.now()),
});

The virtual episodes_vec table holds embeddings only for rows with status='raw' or 'active', ensuring that retrieval automatically skips superseded rows.

Tier‑Aware Decay Scoring

Retrieval score = cosine_similarity(query, episode) × recency_weight × importance_weight.

recency_weight(t, tier) = e^(−λ × days_since_last_access)
λ values:
  tier‑0 (raw)      → 0.04  (≈17‑day half‑life)
  tier‑1 (weekly)  → 0.015 (≈46‑day half‑life)
  tier‑2 (monthly) → 0.005 (≈138‑day half‑life)
  procedural rules → 0 (no decay)
  observations      → 0.02 (≈35‑day half‑life)

importance_weight:
  tier‑0 → 1.0
  tier‑1 → 1.2 (repeated patterns get extra weight)
  tier‑2 → 1.1

Because tier‑1 summaries receive a higher importance weight than raw episodes, a well‑confirmed weekly pattern outranks a single recent event when semantic similarity is comparable.

Context Budget

For a 32 000‑token window the system allocates tokens roughly as follows:

System prompt base: 800 tokens.

Semantic facts (SQL‑fetched): 600 tokens.

Active procedural rules: 400 tokens (usually 3‑8 rules).

Retrieved episodic memories (top‑5): 1 200 tokens.

Retrieved observations (top‑3): 600 tokens.

Current task / working memory: 4 000 tokens.

Tool‑call history: 2 000 tokens.

Response buffer: 2 000 tokens.

Total ≈ 11 600 tokens, leaving headroom for larger tasks. Semantic facts and procedural rules are the cheapest and most reliable memory; a few hundred tokens of verified rules outweigh thousands of tokens of raw episodic data.

Overflow Handling

Keep the last three rounds of email exchange.

Summarise earlier conversation with a low‑cost model into three sentences.

Attach the full transcript as a quoted block that the agent can fetch via a tool if needed.

Never silently truncate; truncation hides information from the agent.

Injection in Practice

Example for a mail‑handling agent:

User: Maria Rossi, Nico Rossi Ltd, fashion sector, formal‑concise tone, Italian
Contact: Marco Bertelli (Bertelli & Co, client): formal tone, no exclamation marks, reliable payments, primary contact for autumn/winter orders
Rules: never send without approval · emails containing 'urgent': high priority

Procedural rules derived from observations:

- Never use semicolons — the user always removes them
- With technical clients: direct, no opening pleasantries, get to the point in the first line
- Quotes above €10,000: don't draft, the user always rewrites them
- Friday after 14:30: defer to Monday, don't draft a response

Episodic snippets (top‑5) and observations are also injected, totalling about 1 800 tokens plus the current email (~400 tokens).

Nightly Maintenance Job

async function nightlyMemoryMaintenance() {
  const now = Date.now();
  const day30ago = now - 30 * 86_400_000;
  const day90ago = now - 90 * 86_400_000;

  // Tier‑1 weekly compaction (30‑day threshold)
  await runTier1Compaction(day30ago);

  // Tier‑2 monthly compaction (90‑day threshold)
  await runTier2Compaction(day90ago);

  // Merge near‑duplicate observations
  await consolidateObservations();

  // Pattern detection on recent raw + tier‑1 data (max 3 new observations)
  await runPatternDetection();

  // Promote qualified observations to procedural rules
  await checkPromotionCandidates();

  // Process GDPR deletion requests across all layers
  await processDeletionRequests();

  // No explicit decay‑refresh step – decay is computed at query time.
}

The order matters: compaction runs before pattern detection so the detector works on already‑summarised timelines, keeping prompts short and cheap.

Selecting an Embedding Model for Behavioral Memory

For most RAG workloads text-embedding-3-small is adequate, but behavioral memory needs robust negation handling. Static mean‑pooled embeddings treat "never" as low‑impact, causing "never do X" and "do X" to appear identical. Contextual encoders (E5, XLM‑RoBERTa) retain the effect of negation because attention encodes token relationships.

E5 requires a prefix on stored passages ( passage:) and a different prefix on queries ( query:); omitting them degrades retrieval quality measurably.

// Storing a document
const docEmbedding = await pipeline(`passage: ${text}`, { pooling: 'mean', normalize: true });

// Querying
const queryEmbedding = await pipeline(`query: ${text}`, { pooling: 'mean', normalize: true });

GDPR and EU AI Act Compliance

Behavioral memory contains personal data (who wrote what, when, and how decisions were made). Under GDPR and the upcoming EU AI Act, such systems are high‑risk and must provide traceability and deletion capabilities.

Namespace each user’s memories from day one so a single DELETE FROM episodes WHERE user_id = ? removes everything in O(1).

Store only behavioural signals (e.g., "user edited draft, removed semicolon"); raw content stays in working memory and is discarded after the session.

Deletion must cascade through raw episodes, tier‑1/2 summaries, observations, procedural rules, and both vector tables ( episodes_vec, observations_vec), as well as the signal queue.

Exported memory files must be encrypted with AES‑256‑GCM; the key is derived from a user‑provided passphrase via scrypt and never written to disk.

Production Checklist

Decay scores never trigger hard deletion; they only trigger compression.

Nightly compaction: tier‑1 at 30 days, tier‑2 at 90 days; source rows are retained (embeddings removed).

Feedback loop: "wrong" feedback cascades to observation rejection, embedding removal, rule demotion, and writes a reverse signal.

Reverse signal prevents re‑insertion of a rejected observation during the next pattern‑detection run.

Hybrid retrieval combines BM25 keyword matches with cosine similarity for robust recall.

Tier‑aware decay uses λ = 0.04 (raw), 0.015 (weekly), 0.005 (monthly), and 0 for procedural rules.

Context budget respects token caps for each memory type; overflow is handled by summarisation, not truncation.

Procedural rules are always injected directly (no vector search) and placed at the front of the system prompt.

Semantic facts are fetched via straight SQL, never via vector search.

Embedding models handle negation (use contextual encoders, not static mean‑pooling).

All stored documents use the passage: prefix; queries use query:.

Embedding calls run in background workers, never blocking the main event loop.

GDPR deletion covers raw episodes, compacted records, embeddings, observations, procedural rules, and signal queue entries.

Only behavioural signals are persisted; raw content lives solely in working memory.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

compactionRAGAgent Memorysemantic memorydecay scoringGDPR complianceobservation engine
AI Engineer Programming
Written by

AI Engineer Programming

In the AI era, defining problems is often more important than solving them; here we explore AI's contradictions, boundaries, and possibilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.