Why Bigger LLMs Still Forget Facts – DeepSeek’s Engram Memory Module Explained

This article analyzes DeepSeek’s new Engram module, showing how conditional memory reduces the compute‑only approach of large language models, improves knowledge retrieval, reasoning, long‑context handling, and system efficiency while maintaining strict parameter and FLOP budgets.

AI Insight Log
AI Insight Log
AI Insight Log
Why Bigger LLMs Still Forget Facts – DeepSeek’s Engram Memory Module Explained

Memory limitation of current large language models

Large language models compute answers for static factual queries because they lack a dedicated memory component, treating static knowledge retrieval and dynamic reasoning as a single costly computation.

Engram architecture

Engram is a conditional memory module that works alongside Mixture‑of‑Experts (MoE) routing. It operates in three stages:

Smart compression : N‑gram indices are token‑normalized (NFKC, lower‑casing) so semantically identical tokens such as Apple and apple map to the same ID, reducing a 128 k vocabulary by 23 %.

Multi‑head hashing : Hash embeddings map the astronomical number of possible N‑grams into a fixed‑size embedding table. Multiple hash functions per N‑gram produce several vectors that are concatenated, reducing collision impact.

Context‑aware gating : The current hidden state serves as a query; retrieved memory provides key/value. A scalar gate α∈[0,1] modulates memory usage (α≈1 fully uses memory, α≈0 suppresses it), allowing adaptation to polysemy and hash conflicts.

Engram Architecture Diagram
Engram Architecture Diagram

U‑shaped scaling law

Experiments varying the proportion ρ of parameters allocated to MoE experts versus Engram memory show a U‑shaped relationship. Pure MoE (ρ=100 %) lacks static knowledge storage; pure Engram (ρ=0 %) loses dynamic reasoning. The optimal allocation lies at ρ≈75‑80 %, where mixed models outperform the pure MoE baseline.

U‑Shaped Scaling Curve
U‑Shaped Scaling Curve

Experimental results

Three models trained on the same 262 B‑token dataset with identical activation parameters (3.8 B) were evaluated:

MoE‑27B: 26.7 B total parameters, 72 routing experts.

Engram‑27B: 26.7 B parameters, 55 experts + 5.7 B memory parameters.

Engram‑40B: 55 experts + 18.5 B memory parameters.

Knowledge‑intensive benchmarks improve consistently (MMLU +3.0, CMMLU +4.0, C‑Eval +4.7, TriviaQA +1.9). Reasoning and code‑math tasks show larger gains (BBH +5.0, ARC‑Challenge +3.7, DROP +3.3, HumanEval +3.0, MATH +2.4, GSM8K +2.2).

Mechanism analysis

Two interpretability tools were applied:

LogitLens reveals lower KL divergence in early layers of Engram models, indicating faster convergence toward the correct answer.

CKA similarity shows that shallow Engram layers align with much deeper MoE layers (e.g., Engram‑27B layer 5 ≈ MoE‑27B layer 12), effectively increasing representational depth without extra computation.

Alignment and Convergence Analysis
Alignment and Convergence Analysis

Long‑context capability

By offloading static N‑gram lookups, Engram frees attention heads for global context. On 32 k‑context evaluations scores increase dramatically (Multi‑Query NIAH 84.2→97.0, Variable Tracking 77.0→89.0, Frequent Words Extraction 73.0→99.3).

System efficiency

Deterministic addressing enables:

Prefetch & overlap : Next‑layer embeddings are known in advance, allowing asynchronous fetch from host memory to GPU and overlapping communication with computation.

Multi‑level caching : High‑frequency embeddings reside in GPU HBM, medium‑frequency in host DRAM, and low‑frequency on NVMe SSD, exploiting the Zipfian distribution of N‑gram accesses.

Near‑zero throughput loss : Storing a 100 B‑parameter Engram table entirely in host memory incurs less than 3 % throughput degradation.

Engram System Implementation
Engram System Implementation

Gating visualization

Heatmaps of the gating scalar α show strong activation for multi‑word entities (e.g., “Alexander the Great”, “the Milky Way”) and fixed phrases, confirming that Engram learns static language patterns, including Chinese idioms and historical figures.

Gating Mechanism Heatmap
Gating Mechanism Heatmap

Sensitivity analysis

Disabling Engram during inference reduces performance on knowledge‑heavy tasks (TriviaQA retains only 29 % of its original score) while reading‑comprehension tasks remain robust (C3 retains 93 %), demonstrating a clean division of labor between static memory and dynamic reasoning.

Technical comparison

Compared with prior approaches (OverEncoding, SCONE, SuperBPE), Engram uniquely provides:

Strict equal‑parameter and FLOP comparisons.

Co‑design with MoE for optimal sparsity allocation.

Algorithm‑system level engineering (deep insertion, prefetching).

Integration with multi‑branch architectures such as mHC.

Paradigm shift

From “pure compute” to “compute + memory”.

From a single sparsity axis to multi‑dimensional sparsity.

From algorithmic tweaks to algorithm‑system co‑design.

Code repository: https://github.com/deepseek-ai/Engram

Reference: Cheng, X. et al. (2026). Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models. arXiv preprint.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsMixture of ExpertsDeepSeekmodel scalingAI architectureconditional memoryEngram
AI Insight Log
Written by

AI Insight Log

Focused on sharing: AI programming | Agents | Tools

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.