Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors
The paper shows that applying lightweight L1 regularization can make over 99% of FFN activations zero, and by using a new tile‑wise ELLPACK (TwELL) format together with a hybrid routing scheme, inference speed improves up to 30% while memory usage drops over 24% and energy consumption is reduced, all with negligible impact on downstream task performance.
Motivation
Since the 2017 “Attention Is All You Need” paper, Transformer models have grown to billions of parameters, making inference, training, memory and energy costs large. In large language models, most feed‑forward network (FFN) hidden activations are near zero for any token, indicating many computations are redundant.
L1‑induced activation sparsity
Adding a lightweight L1 regularizer to the FFN activations drives the proportion of exactly zero activations above 99 % without changing the Transformer architecture.
GPU challenges
Standard dense GPU kernels assume regular, contiguous data. Skipping zero activations naively introduces extra kernel launches, global memory traffic and synchronization, which can erase theoretical FLOP savings.
TwELL: Tile‑wise ELLPACK format
To match sparse storage with tiled matrix‑multiply kernels, the authors propose TwELL. Instead of global row‑aligned ELLPACK, TwELL partitions matrix columns into small 1‑D tiles that align with the tiled‑matmul pattern, eliminating a separate format‑conversion kernel. During the epilogue of the gated‑FFN operator, TwELL generates sparse gate activations on‑the‑fly and merges the up‑projection and down‑projection multiplications into a single pass, avoiding intermediate activation writes.
Hybrid routing for non‑uniform sparsity
During training, memory becomes a bottleneck and token‑wise activation sparsity varies widely. The hybrid routing mechanism sends low‑activation tokens to a highly compressed ELL matrix, while occasional high‑activation tokens are dynamically routed to a dense backup path that leverages Tensor Cores. This reduces dense computation and intermediate activation storage, lowering peak memory pressure.
Experimental setup
Models from 0.5 B to 2 B parameters were trained on 10 B–40 B tokens. The sparsity regularizer used the form λ·|g|₁ on the gate activations. FFN layers account for >⅔ of parameters and >80 % of total FLOPs in modern LLMs.
Results
Forward‑pass speedup up to 20.5 % and training‑step speedup up to 21.9 % on a billion‑parameter model.
Inference speedup ≈ 30 % and memory demand reduction > 24 %.
Peak memory usage and energy consumption decreased proportionally with sparsity.
Down‑stream task performance remained comparable to dense baselines under conservative L1 settings.
Scaling experiments showed larger models benefit more, with greater throughput gains and memory savings.
Activation distribution analysis
Early layers are largely silent, while middle layers carry most of the computational load. Tokens with low activation are often common URL fragments or highly predictable word pieces; high‑activation tokens contain richer contextual information such as verbs, nouns, locations and material names.
Conclusion
High‑sparsity FFN activations can be turned into measurable speed, energy and memory benefits on modern GPUs without redesigning the Transformer architecture. By combining L1‑induced sparsity, the TwELL storage format and hybrid routing, theoretical FLOP reductions become practical performance gains.
Paper: http://arxiv.org/abs/2603.23198
Code: https://github.com/SakanaAI/sparser-faster-llms
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
