Artificial Intelligence 18 min read

How Conditional Memory (Engram) Boosts Large Language Models Beyond MoE

DeepSeek's new paper introduces a conditional memory mechanism called Engram that complements Mixture‑of‑Experts, providing O(1) lookup, improving knowledge retrieval, reasoning, and long‑context performance while scaling efficiently on the same FLOPs budget.

DataFunTalk

Jan 13, 2026

How Conditional Memory (Engram) Boosts Large Language Models Beyond MoE

Background and Problem

Large language models (LLMs) use Mixture‑of‑Experts (MoE) for conditional computation, but the standard Transformer lacks a native knowledge‑lookup primitive. This forces inefficient simulation of retrieval during the forward pass.

Conditional Memory and Engram Module

DeepSeek introduces conditional memory as a sparsity dimension complementary to MoE. The concrete implementation is the Engram module, which adds a deterministic O(1) lookup table for static knowledge to the Transformer.

Architecture Details

Engram modernizes classic N‑gram embeddings with three components:

Vocabulary projection : a projection layer normalizes token IDs (e.g., NFKC, case folding), reducing a 128k tokenizer by ~23%.

Multi‑head hashing : for each N‑gram order n, K independent hash heads approximate the massive N‑gram space, following the method of Svenstrup et al. (2017).

Context‑aware gating : retrieved static embeddings are modulated by the current hidden state via a gate inspired by attention, mitigating hash collisions and token ambiguity.

For each token position Engram performs:

Retrieval : hash‑based lookup of compressed suffix N‑grams.

Fusion : dynamic adjustment of the retrieved vectors and a lightweight convolution before integration with the main Transformer layers.

System Efficiency

Engram decouples storage from computation. The deterministic index enables prefetching of embeddings from host memory during inference, eliminating extra GPU stalls. During training the large embedding table is sharded across GPUs with standard model‑parallel All‑to‑All communication, allowing linear scaling of memory capacity with accelerator count.

Experimental Setup

Four models were trained on the same 2.62‑trillion‑token corpus with identical token budgets and activation FLOPs:

Dense‑4B (41 B parameters)

MoE‑27B (267 B parameters)

Engram‑27B (267 B parameters)

Engram‑40B (395 B parameters)

All models used the DeepSeek‑v3 tokenizer (128 k vocab) and were evaluated on knowledge, reasoning, reading‑comprehension, code, and math benchmarks.

Results

Engram consistently outperformed the dense baseline and matched or exceeded MoE under iso‑FLOPs conditions. Notable gains include:

+3.4 MMLU

+4.0 CMMLU

+5.0 BBH

+3.7 ARC‑Challenge

+3.0 HumanEval

+2.4 MATH

Scaling Engram to 40 B further reduced pre‑training loss and improved most benchmarks, though some tasks showed diminishing returns due to the fixed token budget.

A U‑shaped trade‑off was observed between MoE capacity and Engram slots: allocating ~20‑25 % of the sparse budget to Engram yields optimal performance across model sizes. Increasing the number of Engram slots follows a strict power‑law improvement in validation loss, confirming Engram as a predictable scaling knob.

Long‑context experiments showed that off‑loading local dependencies to Engram frees attention capacity, leading to higher Multi‑Query NIAH accuracy (97.0 vs 84.2) and better variable‑tracking scores.

Conclusions

Conditional memory provides a complementary sparsity axis to MoE, enabling efficient static knowledge retrieval without extra FLOPs. Engram’s deterministic addressing supports hardware‑algorithm co‑design, scalable memory capacity, and superior performance on diverse LLM tasks, positioning it as a core primitive for next‑generation sparse models.

Paper:

https://github.com/deepseek-ai/Engram/blob/main/Engram_paper.pdf

| Code:

https://github.com/deepseek-ai/Engram

large language models Sparse Models Memory retrieval Conditional Memory Engram

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.