Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks
This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.
Linear Attention (Kimi Linear)
Kimi Linear Attention, introduced by Moonlight Darkside, aims to achieve full‑attention quality with linear‑time efficiency on long sequences. It builds on Kimi Delta Attention (KDA), an optimized linear attention mechanism that replaces the traditional KV‑cache accumulation with a state‑update that preserves global context while keeping computation linear in sequence length. By applying a positive‑definite kernel function and a feature‑mapping transformation, KDA computes attention weights via transformed K and V before interacting with Q, enabling streaming updates of a global state S for each new token.
Sparse Attention
The article highlights two sparse‑attention schemes: DuoAttention, a pattern‑matching based method, and DeepSeek‑V3.2’s Dynamic Sparse Attention (DSA). DSA learns sparse patterns during training, consisting of a lighting indexer and a fine‑grained token selector. The lighting indexer projects queries and keys, applies RoPE, and computes scores to select top‑K tokens, while the token selector dynamically adjusts the selection based on content, reducing complexity from O(L²) to O(L·K) where K≪L.
MegaKernel and Mirage Persistent Kernel (MPK)
Traditional LLM inference frameworks such as vLLM and SGLang suffer from kernel barriers and tail effects that create pipeline bubbles and under‑utilized SMs. MegaKernel eliminates repeated kernel launches by fusing the entire model into a single kernel and distributes work at SM granularity. MPK extends this idea with a tGraph representation that splits computation into SM‑level tasks and events, enabling fine‑grained dependency scheduling, shared‑memory paging, cross‑task prefetching, and software pipelines. Optimizations include intra‑node compute/communication streams, NVSHMEM‑based inter‑node communication, and dynamic kernel‑level scheduling, achieving 1–1.7× performance gains, especially in low‑latency, small‑batch scenarios.
Reinforcement Learning for LLMs
The RL workflow for LLMs consists of preparation, rollout, reward assignment, experience collection, and weight synchronization. Rollout dominates runtime (63‑87% of RL time) and suffers from memory spikes and tail latency. Seer addresses these bottlenecks with Divided Rollout (splitting prompts + responses into independent chunks), a global KV‑Cache backed by DRAM/SSD, context‑aware scheduling that predicts request lengths, and Adaptive Grouped Speculative Decoding (DGDS) that aggregates similar responses for dynamic draft tokens. Evaluations on Moonlight, Qwen2‑VL‑72B, and Kimi‑K2 show 74‑97% throughput improvement and 75‑93% tail‑latency reduction.
Deep Learning Compiler – Tawa
Tawa targets the auto‑warp specialization problem for Triton on Hopper GPUs. It introduces an asynchronous reference (aref) IR that abstracts inter‑thread‑bundle communication, providing operations such as put(a, v), get(a), and consumed(a). By converting Triton code to MLIR, inserting aref primitives, and mapping them to hardware primitives (SMEM allocation, mbarrier synchronization), Tawa enables fine‑grained producer‑consumer pipelines. Benchmarks reveal modest gains on short sequences but significant overlap benefits on long sequences where data movement dominates.
References
Linear Attention: https://arxiv.org/pdf/2510.26692
DuoAttention: https://arxiv.org/pdf/2410.10819
DeepSeek‑V3.2 DSA: https://arxiv.org/pdf/2512.02556
Attention Survey: https://attention-survey.github.io/files/Attention_Survey.pdf
vLLM: https://github.com/vllm-project/vllm
SGLang: https://github.com/sgl-project/sglang
Mirage Persistent Kernel: https://arxiv.org/pdf/2512.22219
TileRT: https://github.com/tile-ai/TileRT
SonicMoE: https://arxiv.org/pdf/2512.14080
Tawa: https://arxiv.org/pdf/2510.14719
Seer: https://arxiv.org/pdf/2511.14617
DeepEP: https://github.com/deepseek-ai/DeepEP
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
