Linear Attention — 12 Technical Articles

Machine Learning Algorithms & Natural Language Processing

Apr 22, 2026 · Artificial Intelligence

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

The article presents a two‑step cross‑architecture distillation method that replaces the quadratic softmax attention of Transformers with a learned linear attention and then maps it onto a Mamba backbone, achieving near‑teacher performance while reducing inference cost to linear time.

Cross‑ArchitectureDistillationLinear Attention

0 likes · 8 min read

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

Machine Heart

Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Artificial IntelligenceCross-Architecture DistillationLinear Attention

0 likes · 7 min read

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Machine Learning Algorithms & Natural Language Processing

Apr 21, 2026 · Artificial Intelligence

Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?

The article analyzes why the massive KVCache bandwidth required by heterogeneous pre‑fill/ decode (PD) separation cannot be solved at the system level, proposes a Prefill‑as‑a‑Service architecture that leverages linear‑attention models to cut KVCache generation, and validates the design with a 1‑trillion‑parameter Kimi Linear deployment that achieves 54% higher throughput and 64% lower P90 TTFT across a 100 Gbps inter‑datacenter link.

Heterogeneous PDKVCacheLinear Attention

0 likes · 7 min read

Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?

SuanNi

Mar 16, 2026 · Artificial Intelligence

How NaLaFormer Revives Linear Attention with Query‑Norm Awareness

NaLaFormer introduces a norm‑aware linear attention mechanism that restores the query‑norm‑driven sharpness of softmax attention, achieving up to 7.5% higher ImageNet accuracy and 92% memory reduction in super‑resolution, while delivering strong results across classification, detection, segmentation, and language modeling tasks.

AILinear AttentionNaLaFormer

0 likes · 13 min read

How NaLaFormer Revives Linear Attention with Query‑Norm Awareness

PaperAgent

Feb 15, 2026 · Artificial Intelligence

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

LLMLinear AttentionSparse Attention

0 likes · 17 min read

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

Machine Learning Algorithms & Natural Language Processing

Feb 12, 2026 · Artificial Intelligence

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

The article presents SALA, a sparse‑linear hybrid attention architecture that replaces full attention in 9B‑parameter models, achieving comparable accuracy while cutting compute and memory costs, enabling million‑token inference on a single RTX 5090 and delivering up to 3.5× speed‑up over Qwen3‑8B.

Hybrid Position EncodingLLM efficiencyLinear Attention

0 likes · 18 min read

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

Baobao Algorithm Notes

Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM optimizationLinear Attention

0 likes · 22 min read

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

Baobao Algorithm Notes

Nov 3, 2025 · Artificial Intelligence

Inside Kimi Linear: How Aggressive MoE Sparsity and Hybrid Linear Attention Boost a 3B‑Scale LLM

The author details Kimi Linear's architecture, training challenges, aggressive MoE sparsity, hybrid linear attention design, benchmark gains, and post‑training insights, offering a transparent technical review of this 48B‑parameter MoE LLM built on 5.7 T tokens.

Hybrid ModelKimi LinearLLM

0 likes · 9 min read

Inside Kimi Linear: How Aggressive MoE Sparsity and Hybrid Linear Attention Boost a 3B‑Scale LLM

JavaScript

Mar 20, 2025 · Artificial Intelligence

How MiniMax’s Linear‑Attention Architecture Is Redefining Long‑Context AI Models

MiniMax’s rapid 2025 releases—including a video model, open‑source LLM, and high‑fidelity voice model—showcase its multimodal linear‑attention architecture that handles up to 4 million tokens, earns a16z recognition, and signals China’s growing influence in open‑source AI innovation.

Artificial IntelligenceLinear Attentionlarge language models

0 likes · 8 min read

How MiniMax’s Linear‑Attention Architecture Is Redefining Long‑Context AI Models

AIWalker

Feb 26, 2025 · Artificial Intelligence

Why Linear Attention Lags Behind Softmax and How Two Simple Tweaks Close the Gap

The paper analytically identifies injectivity and local modeling as the two key factors causing the performance gap between linear and Softmax attention, proposes the InLine attention modifications to restore these properties, and demonstrates through extensive Vision Transformer experiments that the enhanced linear attention matches or surpasses Softmax while retaining linear computational cost.

Attention MechanismEfficient TransformersLinear Attention

0 likes · 24 min read

Why Linear Attention Lags Behind Softmax and How Two Simple Tweaks Close the Gap

AIWalker

Jan 17, 2025 · Artificial Intelligence

How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion

The CLEAR method linearizes pretrained Diffusion Transformers by restricting attention to a local window, reducing attention FLOPs by 99.5%, accelerating 8K image generation 6.3× while preserving quality, and supporting multi‑GPU patch‑wise inference for high‑resolution text‑to‑image synthesis.

CLEARDiffusion TransformersEfficient Attention

0 likes · 21 min read

How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion

AI Code to Success

Jan 16, 2025 · Industry Insights

How MiniMax’s Open‑Source Linear‑Attention Model Is Shaking Up the Global AI Landscape

MiniMax, a Shanghai‑based AI unicorn, has open‑sourced its MiniMax‑01 series featuring large‑scale linear attention, secured $600 million in funding, launched multimodal products like Talkie and Hailuo AI, and is positioning itself as a competitive force amid rising geopolitical tensions in the global artificial‑intelligence market.

AIChina AILinear Attention

0 likes · 4 min read

How MiniMax’s Open‑Source Linear‑Attention Model Is Shaking Up the Global AI Landscape