Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture

The STEM architecture replaces the Transformer feed‑forward network with a static token‑indexed embedding table, enabling lookup‑based memory that decouples capacity from compute, improves training stability, expands addressable memory, and delivers consistent performance gains on long‑context and knowledge‑intensive tasks.

Data Party THU
Data Party THU
Data Party THU
Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture

Background

Large language models store knowledge in the feed‑forward network (FFN) up‑projection matrices, which makes the memory implicit, hard to address or edit, and computationally expensive.

STEM Architecture

STEM (Scaling Transformers with Embedding Modules) replaces the up‑projection of each FFN layer with a token‑indexed embedding table. During the forward pass the model looks up a static vector from this table using the token ID; the gate and down‑projection modules are retained to modulate the retrieved vector.

STEM architecture diagram
STEM architecture diagram

Design Details

Remove the up‑projection matrix entirely.

Maintain a per‑layer embedding matrix E_l of shape (V, d), where V is the vocabulary size and d matches the hidden dimension of the original up‑projection.

For token t at layer l, retrieve e_{l,t}=E_l[t] and feed it to the gate and down‑projection.

Embedding tables are learnable parameters during training; gradients are stored in the optimizer state. At inference time the tables are static and can be offloaded to CPU memory.

Key Advantages

Plug‑and‑play knowledge editing : each token maps to a dedicated vector, so swapping or modifying a vector directly changes the model’s factual output without any retraining.

Training stability : the static lookup eliminates load‑skew and loss spikes typical of MoE routing and removes all‑to‑all communication.

Expanded addressable memory : token embeddings exhibit a larger angular spread, making them more orthogonal, reducing cross‑talk and allowing more memory slots under the same compute budget.

Reduced compute and I/O : removing the up‑projection cuts a matrix‑multiplication per layer; the large embedding tables can be offloaded to CPU memory with asynchronous prefetching, lowering GPU memory pressure.

Experimental Results

Evaluations on 350 M and 1 B parameter models show average gains of 3–4 % over dense baselines and 9–10 % on knowledge‑intensive benchmarks. On long‑context tasks such as Needle‑in‑a‑Haystack and LongBench, the advantage grows with context length.

Practical Guidelines

Only replace the up‑projection; keep the gate‑projection unchanged to preserve contextual modulation.

Embedding tables may reside in CPU memory; ensure gradients are written back to the optimizer during training.

For memory‑constrained deployments, consider partial‑layer replacement or hybrid variants that combine static embeddings with conventional FFN layers.

Project Resources

Project homepage: https://infini-ai-lab.github.io/STEM/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformerModel EfficiencyLookup MemorySTEM Architecture
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.