Artificial Intelligence 8 min read

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

The paper shows that applying lightweight L1 regularization can make over 99% of FFN activations zero, and by using a new tile‑wise ELLPACK (TwELL) format together with a hybrid routing scheme, inference speed improves up to 30% while memory usage drops over 24% and energy consumption is reduced, all with negligible impact on downstream task performance.

Machine Learning Algorithms & Natural Language Processing

May 20, 2026

Can 99% Sparse Transformers Run Faster? Insights from the ‘Attention Is All You Need’ Authors

Motivation

Since the 2017 “Attention Is All You Need” paper, Transformer models have grown to billions of parameters, making inference, training, memory and energy costs large. In large language models, most feed‑forward network (FFN) hidden activations are near zero for any token, indicating many computations are redundant.

L1‑induced activation sparsity

Adding a lightweight L1 regularizer to the FFN activations drives the proportion of exactly zero activations above 99 % without changing the Transformer architecture.

GPU challenges

Standard dense GPU kernels assume regular, contiguous data. Skipping zero activations naively introduces extra kernel launches, global memory traffic and synchronization, which can erase theoretical FLOP savings.

TwELL: Tile‑wise ELLPACK format

To match sparse storage with tiled matrix‑multiply kernels, the authors propose TwELL. Instead of global row‑aligned ELLPACK, TwELL partitions matrix columns into small 1‑D tiles that align with the tiled‑matmul pattern, eliminating a separate format‑conversion kernel. During the epilogue of the gated‑FFN operator, TwELL generates sparse gate activations on‑the‑fly and merges the up‑projection and down‑projection multiplications into a single pass, avoiding intermediate activation writes.

Hybrid routing for non‑uniform sparsity

During training, memory becomes a bottleneck and token‑wise activation sparsity varies widely. The hybrid routing mechanism sends low‑activation tokens to a highly compressed ELL matrix, while occasional high‑activation tokens are dynamically routed to a dense backup path that leverages Tensor Cores. This reduces dense computation and intermediate activation storage, lowering peak memory pressure.

Experimental setup

Models from 0.5 B to 2 B parameters were trained on 10 B–40 B tokens. The sparsity regularizer used the form λ·|g|₁ on the gate activations. FFN layers account for >⅔ of parameters and >80 % of total FLOPs in modern LLMs.

Results

Forward‑pass speedup up to 20.5 % and training‑step speedup up to 21.9 % on a billion‑parameter model.

Inference speedup ≈ 30 % and memory demand reduction > 24 %.

Peak memory usage and energy consumption decreased proportionally with sparsity.

Down‑stream task performance remained comparable to dense baselines under conservative L1 settings.

Scaling experiments showed larger models benefit more, with greater throughput gains and memory savings.

Activation distribution analysis

Early layers are largely silent, while middle layers carry most of the computational load. Tokens with low activation are often common URL fragments or highly predictable word pieces; high‑activation tokens contain richer contextual information such as verbs, nouns, locations and material names.

Conclusion

High‑sparsity FFN activations can be turned into measurable speed, energy and memory benefits on modern GPUs without redesigning the Transformer architecture. By combining L1‑induced sparsity, the TwELL storage format and hybrid routing, theoretical FLOP reductions become practical performance gains.

Paper: http://arxiv.org/abs/2603.23198

Code: https://github.com/SakanaAI/sparser-faster-llms

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Transformer CUDA GPU Optimization Hybrid Routing Sparse Activation L1 Regularization

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.