Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 22, 2026 · Artificial Intelligence

Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost

The article presents a two‑step cross‑architecture distillation method that replaces the quadratic softmax attention of Transformers with a learned linear attention and then maps it onto a Mamba backbone, achieving near‑teacher performance while reducing inference cost to linear time.

Cross‑ArchitectureDistillationLinear Attention
0 likes · 8 min read
Turning Transformers into Mamba: A Cross‑Architecture Distillation That Linearizes Inference Cost
Machine Heart
Machine Heart
Apr 22, 2026 · Artificial Intelligence

Apple Turns Transformers into Mamba with Linear‑Cost Distillation

Apple proposes a two‑step cross‑architecture distillation that converts expensive, high‑performing Transformers into cheaper, nearly equally strong Mamba models by first replacing softmax attention with learned linear attention (Hedgehog) and then embedding this intermediate form into Mamba, achieving comparable perplexity and downstream task performance with far lower inference cost.

Artificial IntelligenceCross-Architecture DistillationLinear Attention
0 likes · 7 min read
Apple Turns Transformers into Mamba with Linear‑Cost Distillation
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 21, 2026 · Artificial Intelligence

Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?

The article analyzes why the massive KVCache bandwidth required by heterogeneous pre‑fill/ decode (PD) separation cannot be solved at the system level, proposes a Prefill‑as‑a‑Service architecture that leverages linear‑attention models to cut KVCache generation, and validates the design with a 1‑trillion‑parameter Kimi Linear deployment that achieves 54% higher throughput and 64% lower P90 TTFT across a 100 Gbps inter‑datacenter link.

Heterogeneous PDKVCacheLinear Attention
0 likes · 7 min read
Can Linear Attention Complete Prefill-as-a-Service for Cross‑Datacenter Heterogeneous PD Separation?
SuanNi
SuanNi
Mar 16, 2026 · Artificial Intelligence

How NaLaFormer Revives Linear Attention with Query‑Norm Awareness

NaLaFormer introduces a norm‑aware linear attention mechanism that restores the query‑norm‑driven sharpness of softmax attention, achieving up to 7.5% higher ImageNet accuracy and 92% memory reduction in super‑resolution, while delivering strong results across classification, detection, segmentation, and language modeling tasks.

AILinear AttentionNaLaFormer
0 likes · 13 min read
How NaLaFormer Revives Linear Attention with Query‑Norm Awareness
PaperAgent
PaperAgent
Feb 15, 2026 · Artificial Intelligence

How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits

MiniCPM‑SALA introduces a hybrid sparse‑linear attention architecture that reduces quadratic compute and memory costs, achieves state‑of‑the‑art performance on long‑context benchmarks, and delivers up to 3.5× faster inference than full‑attention models on sequences up to 1 million tokens.

LLMLinear AttentionSparse Attention
0 likes · 17 min read
How MiniCPM‑SALA Merges Sparse and Linear Attention to Break Long‑Context Limits
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Feb 12, 2026 · Artificial Intelligence

Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090

The article presents SALA, a sparse‑linear hybrid attention architecture that replaces full attention in 9B‑parameter models, achieving comparable accuracy while cutting compute and memory costs, enabling million‑token inference on a single RTX 5090 and delivering up to 3.5× speed‑up over Qwen3‑8B.

Hybrid Position EncodingLLM efficiencyLinear Attention
0 likes · 18 min read
Is the Transformer Paradigm Shifting? SALA Handles Million‑Token Context on RTX 5090
Baobao Algorithm Notes
Baobao Algorithm Notes
Feb 4, 2026 · Artificial Intelligence

Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Deep Learning CompilerLLM optimizationLinear Attention
0 likes · 22 min read
Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks
JavaScript
JavaScript
Mar 20, 2025 · Artificial Intelligence

How MiniMax’s Linear‑Attention Architecture Is Redefining Long‑Context AI Models

MiniMax’s rapid 2025 releases—including a video model, open‑source LLM, and high‑fidelity voice model—showcase its multimodal linear‑attention architecture that handles up to 4 million tokens, earns a16z recognition, and signals China’s growing influence in open‑source AI innovation.

Artificial IntelligenceLinear Attentionlarge language models
0 likes · 8 min read
How MiniMax’s Linear‑Attention Architecture Is Redefining Long‑Context AI Models
AIWalker
AIWalker
Feb 26, 2025 · Artificial Intelligence

Why Linear Attention Lags Behind Softmax and How Two Simple Tweaks Close the Gap

The paper analytically identifies injectivity and local modeling as the two key factors causing the performance gap between linear and Softmax attention, proposes the InLine attention modifications to restore these properties, and demonstrates through extensive Vision Transformer experiments that the enhanced linear attention matches or surpasses Softmax while retaining linear computational cost.

Attention MechanismEfficient TransformersLinear Attention
0 likes · 24 min read
Why Linear Attention Lags Behind Softmax and How Two Simple Tweaks Close the Gap
AIWalker
AIWalker
Jan 17, 2025 · Artificial Intelligence

How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion

The CLEAR method linearizes pretrained Diffusion Transformers by restricting attention to a local window, reducing attention FLOPs by 99.5%, accelerating 8K image generation 6.3× while preserving quality, and supporting multi‑GPU patch‑wise inference for high‑resolution text‑to‑image synthesis.

CLEARDiffusion TransformersEfficient Attention
0 likes · 21 min read
How CLEAR Cuts Attention Compute by 99.5% and Enables Efficient On‑Device Text‑to‑Image Diffusion
AI Code to Success
AI Code to Success
Jan 16, 2025 · Industry Insights

How MiniMax’s Open‑Source Linear‑Attention Model Is Shaking Up the Global AI Landscape

MiniMax, a Shanghai‑based AI unicorn, has open‑sourced its MiniMax‑01 series featuring large‑scale linear attention, secured $600 million in funding, launched multimodal products like Talkie and Hailuo AI, and is positioning itself as a competitive force amid rising geopolitical tensions in the global artificial‑intelligence market.

AIChina AILinear Attention
0 likes · 4 min read
How MiniMax’s Open‑Source Linear‑Attention Model Is Shaking Up the Global AI Landscape