Why Bigger LLMs Still Forget Facts – DeepSeek’s Engram Memory Module Explained
This article analyzes DeepSeek’s new Engram module, showing how conditional memory reduces the compute‑only approach of large language models, improves knowledge retrieval, reasoning, long‑context handling, and system efficiency while maintaining strict parameter and FLOP budgets.
Memory limitation of current large language models
Large language models compute answers for static factual queries because they lack a dedicated memory component, treating static knowledge retrieval and dynamic reasoning as a single costly computation.
Engram architecture
Engram is a conditional memory module that works alongside Mixture‑of‑Experts (MoE) routing. It operates in three stages:
Smart compression : N‑gram indices are token‑normalized (NFKC, lower‑casing) so semantically identical tokens such as Apple and apple map to the same ID, reducing a 128 k vocabulary by 23 %.
Multi‑head hashing : Hash embeddings map the astronomical number of possible N‑grams into a fixed‑size embedding table. Multiple hash functions per N‑gram produce several vectors that are concatenated, reducing collision impact.
Context‑aware gating : The current hidden state serves as a query; retrieved memory provides key/value. A scalar gate α∈[0,1] modulates memory usage (α≈1 fully uses memory, α≈0 suppresses it), allowing adaptation to polysemy and hash conflicts.
U‑shaped scaling law
Experiments varying the proportion ρ of parameters allocated to MoE experts versus Engram memory show a U‑shaped relationship. Pure MoE (ρ=100 %) lacks static knowledge storage; pure Engram (ρ=0 %) loses dynamic reasoning. The optimal allocation lies at ρ≈75‑80 %, where mixed models outperform the pure MoE baseline.
Experimental results
Three models trained on the same 262 B‑token dataset with identical activation parameters (3.8 B) were evaluated:
MoE‑27B: 26.7 B total parameters, 72 routing experts.
Engram‑27B: 26.7 B parameters, 55 experts + 5.7 B memory parameters.
Engram‑40B: 55 experts + 18.5 B memory parameters.
Knowledge‑intensive benchmarks improve consistently (MMLU +3.0, CMMLU +4.0, C‑Eval +4.7, TriviaQA +1.9). Reasoning and code‑math tasks show larger gains (BBH +5.0, ARC‑Challenge +3.7, DROP +3.3, HumanEval +3.0, MATH +2.4, GSM8K +2.2).
Mechanism analysis
Two interpretability tools were applied:
LogitLens reveals lower KL divergence in early layers of Engram models, indicating faster convergence toward the correct answer.
CKA similarity shows that shallow Engram layers align with much deeper MoE layers (e.g., Engram‑27B layer 5 ≈ MoE‑27B layer 12), effectively increasing representational depth without extra computation.
Long‑context capability
By offloading static N‑gram lookups, Engram frees attention heads for global context. On 32 k‑context evaluations scores increase dramatically (Multi‑Query NIAH 84.2→97.0, Variable Tracking 77.0→89.0, Frequent Words Extraction 73.0→99.3).
System efficiency
Deterministic addressing enables:
Prefetch & overlap : Next‑layer embeddings are known in advance, allowing asynchronous fetch from host memory to GPU and overlapping communication with computation.
Multi‑level caching : High‑frequency embeddings reside in GPU HBM, medium‑frequency in host DRAM, and low‑frequency on NVMe SSD, exploiting the Zipfian distribution of N‑gram accesses.
Near‑zero throughput loss : Storing a 100 B‑parameter Engram table entirely in host memory incurs less than 3 % throughput degradation.
Gating visualization
Heatmaps of the gating scalar α show strong activation for multi‑word entities (e.g., “Alexander the Great”, “the Milky Way”) and fixed phrases, confirming that Engram learns static language patterns, including Chinese idioms and historical figures.
Sensitivity analysis
Disabling Engram during inference reduces performance on knowledge‑heavy tasks (TriviaQA retains only 29 % of its original score) while reading‑comprehension tasks remain robust (C3 retains 93 %), demonstrating a clean division of labor between static memory and dynamic reasoning.
Technical comparison
Compared with prior approaches (OverEncoding, SCONE, SuperBPE), Engram uniquely provides:
Strict equal‑parameter and FLOP comparisons.
Co‑design with MoE for optimal sparsity allocation.
Algorithm‑system level engineering (deep insertion, prefetching).
Integration with multi‑branch architectures such as mHC.
Paradigm shift
From “pure compute” to “compute + memory”.
From a single sparsity axis to multi‑dimensional sparsity.
From algorithmic tweaks to algorithm‑system co‑design.
Code repository: https://github.com/deepseek-ai/Engram
Reference: Cheng, X. et al. (2026). Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models. arXiv preprint.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
