Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

This article reviews recent 2025 advances in long‑sequence LLM inference, covering Kimi Linear attention, DuoAttention and DeepSeek Sparse Attention, MegaKernel and MPK designs for kernel‑level efficiency, reinforcement‑learning rollout optimizations, and the Tawa deep‑learning compiler framework.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Efficient Long-Sequence Modeling: Linear & Sparse Attention, MegaKernels, RL Tricks

Linear Attention (Kimi Linear)

Kimi Linear Attention, introduced by Moonlight Darkside, aims to achieve full‑attention quality with linear‑time efficiency on long sequences. It builds on Kimi Delta Attention (KDA), an optimized linear attention mechanism that replaces the traditional KV‑cache accumulation with a state‑update that preserves global context while keeping computation linear in sequence length. By applying a positive‑definite kernel function and a feature‑mapping transformation, KDA computes attention weights via transformed K and V before interacting with Q, enabling streaming updates of a global state S for each new token.

Sparse Attention

The article highlights two sparse‑attention schemes: DuoAttention, a pattern‑matching based method, and DeepSeek‑V3.2’s Dynamic Sparse Attention (DSA). DSA learns sparse patterns during training, consisting of a lighting indexer and a fine‑grained token selector. The lighting indexer projects queries and keys, applies RoPE, and computes scores to select top‑K tokens, while the token selector dynamically adjusts the selection based on content, reducing complexity from O(L²) to O(L·K) where K≪L.

MegaKernel and Mirage Persistent Kernel (MPK)

Traditional LLM inference frameworks such as vLLM and SGLang suffer from kernel barriers and tail effects that create pipeline bubbles and under‑utilized SMs. MegaKernel eliminates repeated kernel launches by fusing the entire model into a single kernel and distributes work at SM granularity. MPK extends this idea with a tGraph representation that splits computation into SM‑level tasks and events, enabling fine‑grained dependency scheduling, shared‑memory paging, cross‑task prefetching, and software pipelines. Optimizations include intra‑node compute/communication streams, NVSHMEM‑based inter‑node communication, and dynamic kernel‑level scheduling, achieving 1–1.7× performance gains, especially in low‑latency, small‑batch scenarios.

Reinforcement Learning for LLMs

The RL workflow for LLMs consists of preparation, rollout, reward assignment, experience collection, and weight synchronization. Rollout dominates runtime (63‑87% of RL time) and suffers from memory spikes and tail latency. Seer addresses these bottlenecks with Divided Rollout (splitting prompts + responses into independent chunks), a global KV‑Cache backed by DRAM/SSD, context‑aware scheduling that predicts request lengths, and Adaptive Grouped Speculative Decoding (DGDS) that aggregates similar responses for dynamic draft tokens. Evaluations on Moonlight, Qwen2‑VL‑72B, and Kimi‑K2 show 74‑97% throughput improvement and 75‑93% tail‑latency reduction.

Deep Learning Compiler – Tawa

Tawa targets the auto‑warp specialization problem for Triton on Hopper GPUs. It introduces an asynchronous reference (aref) IR that abstracts inter‑thread‑bundle communication, providing operations such as put(a, v), get(a), and consumed(a). By converting Triton code to MLIR, inserting aref primitives, and mapping them to hardware primitives (SMEM allocation, mbarrier synchronization), Tawa enables fine‑grained producer‑consumer pipelines. Benchmarks reveal modest gains on short sequences but significant overlap benefits on long sequences where data movement dominates.

References

Linear Attention: https://arxiv.org/pdf/2510.26692

DuoAttention: https://arxiv.org/pdf/2410.10819

DeepSeek‑V3.2 DSA: https://arxiv.org/pdf/2512.02556

Attention Survey: https://attention-survey.github.io/files/Attention_Survey.pdf

vLLM: https://github.com/vllm-project/vllm

SGLang: https://github.com/sgl-project/sglang

Mirage Persistent Kernel: https://arxiv.org/pdf/2512.22219

TileRT: https://github.com/tile-ai/TileRT

SonicMoE: https://arxiv.org/pdf/2512.14080

Tawa: https://arxiv.org/pdf/2510.14719

Seer: https://arxiv.org/pdf/2511.14617

DeepEP: https://github.com/deepseek-ai/DeepEP

reinforcement learningLLM optimizationDeep Learning CompilerSparse AttentionLinear AttentionMegaKernel
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.