Goodbye H100: How DeepSeek’s Engram Uses CPU Memory to Scale LLM Knowledge Bases

DeepSeek’s Engram architecture adds a deterministic dictionary lookup to Transformers, storing massive N‑gram tables in cheap CPU DRAM, which reduces GPU memory use and boosts both knowledge‑heavy and reasoning benchmarks while keeping inference latency under 3%.

AI Insight Log
AI Insight Log
AI Insight Log
Goodbye H100: How DeepSeek’s Engram Uses CPU Memory to Scale LLM Knowledge Bases

DeepSeek has open‑sourced a new Transformer architecture called Engram together with the paper “Conditional Memory via Scalable Lookup”. Engram equips the model with a “dictionary” that can retrieve static knowledge instead of recomputing it with attention and feed‑forward layers.

The motivation is that large language models waste valuable GPU compute on memorising facts such as country names or titles. By moving this memorisation to a massive N‑gram table, the expensive GPU memory (HBM) can be freed for genuine reasoning.

For example, to generate the entity “Diana, Princess of Wales” the vanilla model would spend six layers: (1) recognise “Wales” as a place, (2) recognise “Princess of Wales” as a title, (3) combine them into a person name. Engram replaces this costly chain with a lookup.

The Engram mechanism consists of three steps:

Identify : the model automatically detects fixed phrases such as “人工智能” or “DeepSeek”.

Lookup : a deterministic hash‑based search retrieves the corresponding vector from a gigantic N‑gram table. The lookup is extremely fast and consumes negligible compute.

Fuse : the retrieved vector is fed to the Transformer through a gating module, allowing the network to focus on higher‑level reasoning.

This is analogous to giving the model a cheat‑sheet: for rote facts it reads the sheet, for novel problems it thinks.

In experiments with 27 B parameters and identical Iso‑FLOPs, Engram was compared against a standard Mixture‑of‑Experts (MoE) model. The results show:

Knowledge : MMLU improves by +3.4 points, CMMLU by +4.0 points.

Reasoning : BBH (+5.0), ARC‑Challenge (+3.7), and noticeable gains on MATH and HumanEval.

Why does a simple dictionary boost reasoning? LogitLens analysis (as reported in the paper) reveals that Engram enables the model to complete feature composition in shallower layers, freeing deeper layers to handle complex logic and long‑range dependencies. For long‑context tasks, local dependencies are satisfied by the dictionary, so attention can concentrate on global context.

Engram stores the N‑gram table in cheap CPU DRAM. Because the lookup is deterministic, the system can pre‑fetch the required entries over PCIe while the GPU processes the first few layers. Even a 100 B‑parameter dictionary adds less than 3 % overhead to inference latency.

The key implication is that future LLMs can expand their knowledge bases at very low cost by stacking CPU memory instead of buying more H100 GPUs. The Engram codebase is publicly available on GitHub.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMTransformerN-gramEngramCPU memoryDeterministic LookupKnowledge Base Scaling
AI Insight Log
Written by

AI Insight Log

Focused on sharing: AI programming | Agents | Tools

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.