Artificial Intelligence 7 min read

How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge

The article dissects DeepSeek’s new Engram architecture, which separates computation from memory by using a large, cheap‑RAM‑based lookup table to store factual knowledge, allowing the transformer’s compute layers to focus on reasoning, dramatically reducing GPU memory demand while improving code, math, and long‑context performance.

Network Intelligence Research Center (NIRC)

Jan 31, 2026

How Engram Lets Large Models Swap GPU Memory for Cheap RAM to ‘Look Up’ Knowledge

DeepSeek, in collaboration with Peking University, released the paper Conditional Memory via Scalable Lookup , proposing a new architectural direction for large language models. Instead of increasing parameter counts or context windows, the authors revive the classic N‑gram idea and attach a “dictionary” module to the Transformer, aiming to reduce the reliance on expensive GPU memory.

Core Pain Point – Computation Is Not Memory

Current AI models, whether dense or Mixture‑of‑Experts, perform all tasks through heavy matrix multiplications. For factual queries such as a line of poetry, the model still computes the answer rather than retrieving it directly, wasting compute resources on memorized facts and leaving insufficient capacity for genuine logical reasoning.

The paper argues that language models should operate in two modes:

Dynamic computation for logic, code, and complex reasoning.

Static lookup for facts, fixed expressions, and rote knowledge.

Present‑day Transformers only exhibit the first capability.

Solution – Engram Architecture

Engram introduces a three‑step mechanism:

Rapid Lookup : When the model encounters a token sequence (e.g., “Alexander the Great”), it maps the phrase to a massive vector table and retrieves the corresponding embedding without deep network processing. The lookup speed is independent of table size.

Context Gating : Retrieved vectors are filtered by a gate. If the current task needs the knowledge (e.g., idiom completion), the gate opens and the vector is fused; if the task requires logical inference, the gate closes and the retrieved content is ignored.

Compute‑Memory Decoupling : Because the lookup is deterministic, the large parameter table (potentially up to 100 billion entries) can be stored in inexpensive host‑memory rather than GPU VRAM, freeing GPU resources for the remaining Transformer layers.

Experimental Results and Significance

A 27‑billion‑parameter Engram model was trained and evaluated. Contrary to the expectation that a memory module only helps knowledge‑heavy tasks, the model showed larger gains on code (HumanEval), mathematics (MATH), and logical reasoning (BBH) benchmarks than on pure knowledge questions. The authors attribute this to the memory module offloading “rote‑matching” work from the early Transformer layers, allowing deeper layers to concentrate on complex reasoning.

In the “Needle‑in‑a‑Haystack” (NIAH) long‑text test, Engram’s score rose from 84.2 to 97.0, indicating that the lookup reduces attention to trivial local dependencies and improves capture of global, long‑range relationships.

System‑level tests placed a 100‑billion‑parameter Engram table entirely in host memory. Thanks to pre‑fetching, inference speed dropped by less than 3%, effectively breaking the GPU memory bottleneck.

Conclusion

The Engram design demonstrates that separating computation (handled by GPU) from memory (handled by cheap RAM) can both cut GPU memory usage and boost performance on reasoning‑heavy tasks. This MoE‑plus‑Engram dual‑sparse architecture is likely to shape DeepSeek’s next‑generation models and opens a path for deploying ultra‑large models on consumer‑grade hardware.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

transformer Large Language Model N-gram GPU Memory Engram Memory-Compute Architecture

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.