Can Linear‑Time LSTMs Beat Transformers? Scaling Laws Reveal the Answer
The paper presents a systematic scaling‑law study of the linear‑time xLSTM architecture versus quadratic‑time Transformers, evaluating parameter‑data loss surfaces, optimal model size under equal FLOP budgets, and inference latency components, and shows that xLSTM consistently offers better cost‑effectiveness across diverse contexts and budgets.
The paper "xLSTM Scaling Laws: Competitive Performance with Linear Time‑Complexity" (2024) investigates the Extended LSTM (xLSTM) architecture, which replaces the quadratic‑time self‑attention of Transformers with linear‑time recurrence while retaining modern training techniques such as stable normalization, deep stacking, feed‑forward MLPs, and dimension‑wise parallelism.
Linear‑time recurrence vs. attention
Transformers incur a quadratic cost in context length T for both pre‑fill (compute‑bound) and KV‑state bandwidth (memory‑bound). Extending the context from 2k to 8k or 16k therefore inflates the cost quadratically. xLSTM delegates sequence mixing to the recurrent dynamics of mLSTM, giving a computational complexity that grows linearly with T, which makes scaling to long contexts more predictable.
Scaling‑law framework
The authors formulate a unified validation‑loss surface L(N,D) that depends on model parameter count N and training token count D: L(N,D) = E + γ₁·N^{‑α₁} + γ₂·D^{‑α₂} where E is a floor term and the two decay terms capture marginal gains from increasing parameters or data. This surface enables simultaneous analysis of "scale‑up" (more parameters) and "add‑data" (more tokens).
Optimal allocation under a compute budget
Given a fixed compute budget H, the iso‑compute curve C(N,D)=H is traced. The loss minima on each curve are identified and fitted to power‑law relationships:
N ∝ H^{β}
D ∝ H^{γ}These "growth laws" specify how many parameters and how many training tokens should be added when the budget doubles.
Inference latency model
Inference is split into two stages:
Prefill : approximately compute‑bound.
Token‑by‑token generation : approximately memory‑bound, limited by KV‑state bandwidth.
A latency model separates a compute term from a bandwidth term, allowing clear identification of the dominant bottleneck for each architecture.
Experimental methodology
The study conducts 672 training runs covering:
Two architectures: Transformer and xLSTM.
Two training configurations: IsoFLOP (fixed FLOPs) and Token‑Param scaling.
Three context lengths (≈2k, 8k, 16k).
Model sizes from 80 M to 7 B parameters.
Compute budgets ranging from 2.8×10¹⁸ to 8.5×10²² FLOPs.
Training token counts from 2 B to 2 T.
Key empirical findings
On a "budget‑loss" plane (loss vs. FLOPs), xLSTM consistently occupies the lower‑left region, i.e., lower loss for the same FLOPs, especially for long contexts.
Loss‑budget curves for both architectures remain approximately parallel across token/parameter regimes, indicating stable power‑law exponents; differences lie mainly in the coefficient terms.
When the context length increases, the optimal parameter‑data allocation for Transformers shifts sharply downward because the quadratic term dominates the budget, whereas xLSTM’s linear scaling yields a gentler decline.
Inference latency analysis shows that xLSTM’s prefilling and per‑token step times scale more favorably with context length, confirming the theoretical advantage of linear‑time recurrence.
Limitations
The loss‑surface fit is reliable near the optimal region and within typical training regimes, but may require recalibration for out‑of‑distribution tasks or extreme configurations. The latency model abstracts hardware specifics; actual performance depends on compiler optimizations, operator fusion, and memory‑system characteristics.
Reproducibility
All code, data, and training scripts are released at https://github.com/NX-AI/xlstm_scaling_laws. The paper provides the loss‑surface equation, the growth‑law exponents, and the latency model so that other researchers can reproduce the scaling analysis on different hardware or datasets.
Conclusion
When evaluated under equal compute, the linear‑time xLSTM reaches the Pareto frontier more closely than attention‑only Transformers, shows smoother scaling with longer contexts, and offers a quantifiable cost‑effectiveness advantage. The work shifts the discussion from "which architecture is universally better" to "which design yields the best trade‑off between compute, data, and performance for a given budget and context length."
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
