How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

The article presents STEM, a method that transforms dense and MoE transformer architectures by converting the expert routing step into a static table‑lookup operation, achieving higher parameter efficiency, lower communication overhead, and improved interpretability while maintaining or boosting downstream task performance.

PaperAgent
PaperAgent
PaperAgent
How STEM Replaces MoE Routing with Simple Table Lookup for Faster Transformers

Core Idea: Turning Matrix Multiplication into Table Lookup

STEM modifies the standard transformer feed‑forward network (FFN) by replacing the dynamic routing of Mixture‑of‑Experts (MoE) with a static embedding table lookup that only alters the upper projection of each FFN layer, leaving the rest of the model unchanged.

Technical Modification

SiLU(Wg·x) ⊙ (Wu·x)

(original FFN) becomes SiLU(Wg·x) ⊙ U[t] where U[t] ∈ R^{d_{ff}}_{ff} is a token‑specific row vector stored statically in an embedding table.

No routing network, no all‑to‑all communication; the method supports CPU‑offline storage of the table with asynchronous GPU prefetch.

Communication volume depends only on the number of unique tokens in a batch, not on the number of experts.

Experimental Results Overview

Two model scales were evaluated:

350 M parameters: +3.0 % average accuracy, +9.4 % on ARC‑C, +8.4 % on NIAH, FLOPs reduced by 22 %.

1 B parameters: +3.4 % average accuracy, +10 % on OpenBookQA, 13 % longer context window, parameter access reduced by 33 %.

Four Main Advantages

Stable Training – No loss spikes

Large Knowledge Capacity – Reduced embedding interference

Interpretability – Direct "table swap" changes model behavior

Long‑Context Gains – More unique embeddings activated as sequence length grows

When the context length increases, the number of distinct embeddings grows, leading to effective parameter scaling; NIAH improvement rises from 8 % to 13 %.

Ablation Study: Where and How Much to Replace?

Replacing only the upper projection (STEM) yields the highest average score (50.60). Adding a supplemental table (STEM†) gives a comparable score (50.58). Replacing the gating projection hurts performance, confirming the intuition that gating must attend to the current token.

Increasing the replacement ratio improves both accuracy and ROI (accuracy per FLOPs): 1/3 layers (+1.8 %, 1.08× ROI), 1/2 layers (+4.5 %, 1.20× ROI), full replacement (+4.9 %, 1.33× ROI).

System‑Level Implementation Tricks

CPU‑offload: store the embedding table in host memory and prefetch asynchronously on GPU.

Token deduplication: transmit only unique token IDs within a batch, reducing communication by 30‑50 %.

LFU cache: achieve >80 % hit rate under Zipf‑distributed token frequencies.

Training parallelism: shard the embedding table by vocabulary, decoupling it from model tensor‑parallel (TP) and pipeline‑parallel (PP) partitions.

One‑Line Summary

STEM shows that a simple static‑sparse lookup can replace complex MoE routing, delivering parameter‑efficient, interpretable, and high‑performing transformers.

https://github.com/Infini-AI-Lab/STEM
https://arxiv.org/pdf/2601.10639
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

TransformerMixture of Expertsmodel efficiencyInterpretabilitySparse ModelsEmbedding Lookup
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.