Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

A new ICML 2026 paper by Sakana AI and NVIDIA shows that applying lightweight L1 regularization can make Feed‑Forward Network activations in Transformers over 99% sparse, and with the TwELL storage format and a hybrid routing scheme this sparsity translates into up to 20.5% inference speedup, 21.9% training‑step acceleration, lower energy consumption and reduced peak memory across 0.5‑2 B‑parameter models while preserving downstream performance.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Can 99% Sparse Transformers Run Faster? Insights from the Original Authors

In 2017 the "Attention Is All You Need" paper popularized the Transformer architecture, and today most large language models (LLMs) are built on it. As model size grows, the cost of inference, training, memory and energy also rises.

Sparsity in Feed‑Forward Networks

Within a Transformer, the Feed‑Forward Network (FFN) accounts for more than two‑thirds of parameters and over 80% of FLOPs, yet only a small fraction of its hidden activations contribute meaningfully for a given token; many activations are near zero.

Inducing >99% Sparsity with L1 Regularization

The ICML 2026 work from Sakana AI and NVIDIA adds a lightweight L1 regularizer to the FFN, driving the proportion of non‑zero activations above 99% sparse. This sparsity alone does not guarantee speed gains on GPUs because modern GPU kernels are optimized for dense, regular computation.

Why Skipping Zeros Can Hurt on GPUs

Generating a full gate activation matrix and then converting it to a sparse format incurs extra kernel launches, global memory reads/writes and synchronization overhead, which can offset the theoretical FLOP reduction.

TwELL: Tile‑wise ELLPACK Format

To eliminate conversion overhead, the authors propose TwELL (Tile‑wise ELLPACK) , a storage format that abandons global row alignment and partitions matrix columns into local 1‑D tiles that match the tiled matrix‑multiply kernels used on GPUs. TwELL allows gate activations to be generated directly in the operator epilogue, avoiding a separate conversion kernel and reducing global memory traffic.

TwELL format diagram
TwELL format diagram

Hybrid Routing for Non‑Uniform Sparsity

During training, memory becomes the bottleneck and the number of non‑zero activations varies widely across tokens. The authors introduce a Hybrid Routing mechanism: low‑activation tokens are routed to a highly compressed ELL matrix, while occasional high‑activation tokens are dynamically sent to a dense fallback path that leverages Tensor Cores.

Hybrid routing diagram
Hybrid routing diagram

Experimental Results

The team trained models ranging from 0.5 B to 2 B parameters (10 B–40 B tokens). With modest L1 regularization, average non‑zero activations dropped by several orders of magnitude, and downstream task performance remained comparable to dense baselines.

Key gains observed on a 1 B‑parameter model:

Forward‑pass speedup: 20.5%

Training‑step speedup: 21.9%

Inference energy consumption reduced proportionally

Peak memory usage during training decreased noticeably

Further benchmarks showed inference speed improvements up to 30% and memory savings exceeding 24% on real workloads, with larger models benefiting even more.

Inference speed and energy savings
Inference speed and energy savings

Activation Distribution Insights

Analyzing sparse activations reveals that early layers are relatively quiet, while middle layers are most active, handling core inference and knowledge retrieval. Tokens with low activity tend to be common URL fragments or highly predictable sub‑words, whereas high‑activity tokens contain richer contextual information such as verbs, nouns, locations, and material names.

Layer‑wise non‑zero activation distribution
Layer‑wise non‑zero activation distribution

Conclusion

The work does not replace the Transformer architecture nor rely on complex structural changes. Its contribution lies in integrating FFN activation sparsity into the actual GPU execution pipeline using a sparse format and custom CUDA kernels, thereby converting theoretical FLOP reductions into measurable speed, energy and memory benefits.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMTransformerCUDAGPU OptimizationHybrid RoutingSparse ActivationL1 Regularization
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.