How Engram’s ‘Lookup‑Compute Separation’ Boosts LLM Performance

DeepSeek’s newly open‑sourced Engram module introduces a scalable lookup‑based memory that separates knowledge retrieval from computation, enabling O(1) deterministic access and significantly improving large language model performance on knowledge‑heavy, reasoning, code, and math tasks without extra FLOPs.

Data Party THU
Data Party THU
Data Party THU
How Engram’s ‘Lookup‑Compute Separation’ Boosts LLM Performance

Motivation for Engram

Large language models (LLMs) conflate two distinct functions in their parameters: memorizing factual knowledge and performing logical computation. Scaling parameters to store more facts increases FLOPs, and even Mixture‑of‑Experts (MoE) models are inefficient for pure memorization. Engram separates "lookup" from "compute" to improve parameter efficiency.

Core Architecture

Engram introduces a scalable, searchable memory module that is placed early in the Transformer stack. The processing pipeline is:

Tokenize the input sequence.

Form overlapping N‑grams (contiguous groups of N tokens).

Hash each N‑gram with a deterministic hash function.

Use the hash as an index into a large learnable lookup table of embeddings.

Retrieve the embedding in O(1) time.

Condition the retrieval on the current hidden state, so only relevant memory entries are fetched.

The retrieved embeddings are injected into the model before the deeper reasoning layers (dense or MoE), providing "pattern reconstruction" or "background facts" for subsequent computation.

Modernized Hashed N‑gram Embeddings

Traditional Transformers extract features through multiple self‑attention and MLP layers. Engram replaces this repeated extraction for static patterns with a hash‑based lookup:

Traditional: Repeated nonlinear transformations over the entire token sequence.

Engram: Direct mapping of hashed N‑grams to a learnable table, yielding deterministic constant‑time access regardless of table size.

This design offloads the "memory" responsibility from neural computation, allowing the model to allocate most parameters to reasoning while keeping memory lookup cost negligible.

Relationship to MoE

MoE provides a sparsity axis by activating only a subset of expert networks for compute‑intensive tasks. Engram adds a complementary sparsity axis by activating only a subset of static memory entries. The two axes work together:

Goal: MoE reduces active neural compute; Engram reduces neural reconstruction of known patterns.

Computation: MoE – sparse dense matrix operations; Engram – O(1) table lookup.

Placement: MoE – deep reasoning layers; Engram – early pattern reconstruction / memory retrieval.

Performance Highlights

In a 27‑billion‑parameter experiment, the Engram module occupied a large portion of the parameter budget for memory but contributed only a tiny fraction of FLOPs during inference, dramatically improving parameter efficiency while preserving or improving performance on knowledge‑intensive tasks such as factual QA, code generation, and mathematical reasoning.

Implementation Details

The full source code and paper are publicly available:

Paper URL: https://github.com/deepseek-ai/Engram/blob/main/Engram_paper.pdf

Code repository: https://github.com/deepseek-ai/Engram

Key implementation points:

Hash function: deterministic, collision‑resistant mapping of N‑grams to 64‑bit indices.

Lookup table size: configurable (e.g., 1‑2 trillion entries) and can be sharded across host memory.

Conditional gating: a lightweight neural gate evaluates the current hidden state and selects a small set of indices to read.

Integration point: the Engram layer is inserted after the first embedding layer and before any MoE or dense transformer blocks.

Representative Diagram

Engram architecture diagram
Engram architecture diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMMoEmemory architectureLookupScalable Retrieval
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.