How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers
The article presents STEM, a method that transforms dense and MoE transformer architectures by converting the expert routing step into a static table‑lookup operation, achieving higher parameter efficiency, lower communication overhead, and improved interpretability while maintaining or boosting downstream task performance.
Core Idea: Turning Matrix Multiplication into Table Lookup
STEM modifies the standard transformer feed‑forward network (FFN) by replacing the dynamic routing of Mixture‑of‑Experts (MoE) with a static embedding table lookup that only alters the upper projection of each FFN layer, leaving the rest of the model unchanged.
Technical Modification
SiLU(Wg·x) ⊙ (Wu·x)(original FFN) becomes SiLU(Wg·x) ⊙ U[t] where U[t] ∈ R^{d_{ff}}_{ff} is a token‑specific row vector stored statically in an embedding table.
No routing network, no all‑to‑all communication; the method supports CPU‑offline storage of the table with asynchronous GPU prefetch.
Communication volume depends only on the number of unique tokens in a batch, not on the number of experts.
Experimental Results Overview
Two model scales were evaluated:
350 M parameters: +3.0 % average accuracy, +9.4 % on ARC‑C, +8.4 % on NIAH, FLOPs reduced by 22 %.
1 B parameters: +3.4 % average accuracy, +10 % on OpenBookQA, 13 % longer context window, parameter access reduced by 33 %.
Four Main Advantages
Stable Training – No loss spikes
Large Knowledge Capacity – Reduced embedding interference
Interpretability – Direct "table swap" changes model behavior
Long‑Context Gains – More unique embeddings activated as sequence length grows
When the context length increases, the number of distinct embeddings grows, leading to effective parameter scaling; NIAH improvement rises from 8 % to 13 %.
Ablation Study: Where and How Much to Replace?
Replacing only the upper projection (STEM) yields the highest average score (50.60). Adding a supplemental table (STEM†) gives a comparable score (50.58). Replacing the gating projection hurts performance, confirming the intuition that gating must attend to the current token.
Increasing the replacement ratio improves both accuracy and ROI (accuracy per FLOPs): 1/3 layers (+1.8 %, 1.08× ROI), 1/2 layers (+4.5 %, 1.20× ROI), full replacement (+4.9 %, 1.33× ROI).
System‑Level Implementation Tricks
CPU‑offload: store the embedding table in host memory and prefetch asynchronously on GPU.
Token deduplication: transmit only unique token IDs within a batch, reducing communication by 30‑50 %.
LFU cache: achieve >80 % hit rate under Zipf‑distributed token frequencies.
Training parallelism: shard the embedding table by vocabulary, decoupling it from model tensor‑parallel (TP) and pipeline‑parallel (PP) partitions.
One‑Line Summary
STEM shows that a simple static‑sparse lookup can replace complex MoE routing, delivering parameter‑efficient, interpretable, and high‑performing transformers.
https://github.com/Infini-AI-Lab/STEM https://arxiv.org/pdf/2601.10639Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
