DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author
DeepSeek introduces the NSA sparse attention mechanism, combining dynamic hierarchical sparsity, coarse token compression and fine token selection to achieve up to 11.6× faster inference, lower pre‑training cost, and superior benchmark performance across general, long‑context, and chain‑of‑thought tasks.
Motivation: the need for faster long‑context modeling
As large language models tackle ever longer contexts—codebases, lengthy documents, and multi‑turn agents—the traditional softmax attention becomes a bottleneck, consuming 70‑80% of decoding latency for 64k tokens. Reducing this cost while preserving performance is critical.
NSA architecture: dynamic hierarchical sparsity
The paper proposes NSA, a native trainable sparse attention consisting of three core components:
Dynamic hierarchical sparsity strategy
Coarse‑grained token compression
Fine‑grained token selection
These techniques retain global context awareness and local precision, enabling an 11.6× speedup during decoding without sacrificing accuracy.
Hardware‑friendly implementation
NSA is built with Triton kernels optimized for modern GPUs. The implementation includes three key optimizations:
In‑group data loading: each inner loop loads all queries of a head and its shared sparse KV block indices.
Shared KV loading: consecutive KV blocks are loaded to reduce memory traffic.
Grid‑loop scheduling: inner‑loop length is uniform across query groups, allowing Triton’s grid scheduler to streamline kernel execution.
These steps balance compute intensity and achieve near‑optimal hardware utilization.
Benchmark evaluation
NSA was evaluated against full‑attention baselines and state‑of‑the‑art sparse methods on three fronts:
General pre‑training loss: NSA’s loss curve is smoother and consistently lower than full attention.
Long‑context tasks: on a 64k “needle‑in‑a‑haystack” benchmark, NSA’s hierarchical design yields high retrieval precision.
Chain‑of‑thought reasoning: using a 27B model (3B active parameters) fine‑tuned on 100 B 32k math trajectories, NSA‑R outperforms Full‑Attention‑R by 0.075 accuracy at 8k context and 0.054 at 16k.
On the LongBench suite, NSA achieves the highest average score of 0.469, surpassing all competitors.
Comparison with prior work
Previous sparse attention approaches (KV‑cache eviction, block selection, sampling/ hashing) focus mainly on inference and lack training support. NSA addresses both phases, delivering end‑to‑end speed gains and lower pre‑training compute.
Additionally, the paper validates an early Tsinghua Yao‑class study on complex arithmetic: NSA reduces required tokens from 9,392 to 2,275 for a four‑digit multiplication task, delivering the correct answer where the baseline fails.
Conclusion and outlook
NSA demonstrates that a well‑designed sparse attention can outperform dense attention across multiple metrics while being hardware‑friendly. Future DeepSeek research is expected to further refine long‑text and code‑base analysis to boost practical reasoning capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
