Can Lookup-Based Memory Revolutionize Transformers? Inside the STEM Architecture
The STEM architecture replaces the Transformer feed‑forward network with a static token‑indexed embedding table, enabling lookup‑based memory that decouples capacity from compute, improves training stability, expands addressable memory, and delivers consistent performance gains on long‑context and knowledge‑intensive tasks.
Background
Large language models store knowledge in the feed‑forward network (FFN) up‑projection matrices, which makes the memory implicit, hard to address or edit, and computationally expensive.
STEM Architecture
STEM (Scaling Transformers with Embedding Modules) replaces the up‑projection of each FFN layer with a token‑indexed embedding table. During the forward pass the model looks up a static vector from this table using the token ID; the gate and down‑projection modules are retained to modulate the retrieved vector.
Design Details
Remove the up‑projection matrix entirely.
Maintain a per‑layer embedding matrix E_l of shape (V, d), where V is the vocabulary size and d matches the hidden dimension of the original up‑projection.
For token t at layer l, retrieve e_{l,t}=E_l[t] and feed it to the gate and down‑projection.
Embedding tables are learnable parameters during training; gradients are stored in the optimizer state. At inference time the tables are static and can be offloaded to CPU memory.
Key Advantages
Plug‑and‑play knowledge editing : each token maps to a dedicated vector, so swapping or modifying a vector directly changes the model’s factual output without any retraining.
Training stability : the static lookup eliminates load‑skew and loss spikes typical of MoE routing and removes all‑to‑all communication.
Expanded addressable memory : token embeddings exhibit a larger angular spread, making them more orthogonal, reducing cross‑talk and allowing more memory slots under the same compute budget.
Reduced compute and I/O : removing the up‑projection cuts a matrix‑multiplication per layer; the large embedding tables can be offloaded to CPU memory with asynchronous prefetching, lowering GPU memory pressure.
Experimental Results
Evaluations on 350 M and 1 B parameter models show average gains of 3–4 % over dense baselines and 9–10 % on knowledge‑intensive benchmarks. On long‑context tasks such as Needle‑in‑a‑Haystack and LongBench, the advantage grows with context length.
Practical Guidelines
Only replace the up‑projection; keep the gate‑projection unchanged to preserve contextual modulation.
Embedding tables may reside in CPU memory; ensure gradients are written back to the optimizer during training.
For memory‑constrained deployments, consider partial‑layer replacement or hybrid variants that combine static embeddings with conventional FFN layers.
Project Resources
Project homepage: https://infini-ai-lab.github.io/STEM/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
