Artificial Intelligence 9 min read

DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author

DeepSeek introduces the NSA sparse attention mechanism, combining dynamic hierarchical sparsity, coarse token compression and fine token selection to achieve up to 11.6× faster inference, lower pre‑training cost, and superior benchmark performance across general, long‑context, and chain‑of‑thought tasks.

AIWalker

Feb 19, 2025

DeepSeek’s NSA Attention Cuts Inference Time 11× – CEO Liang Co‑author

Motivation: the need for faster long‑context modeling

As large language models tackle ever longer contexts—codebases, lengthy documents, and multi‑turn agents—the traditional softmax attention becomes a bottleneck, consuming 70‑80% of decoding latency for 64k tokens. Reducing this cost while preserving performance is critical.

NSA architecture: dynamic hierarchical sparsity

The paper proposes NSA, a native trainable sparse attention consisting of three core components:

Dynamic hierarchical sparsity strategy

Coarse‑grained token compression

Fine‑grained token selection

These techniques retain global context awareness and local precision, enabling an 11.6× speedup during decoding without sacrificing accuracy.

Hardware‑friendly implementation

NSA is built with Triton kernels optimized for modern GPUs. The implementation includes three key optimizations:

In‑group data loading: each inner loop loads all queries of a head and its shared sparse KV block indices.

Shared KV loading: consecutive KV blocks are loaded to reduce memory traffic.

Grid‑loop scheduling: inner‑loop length is uniform across query groups, allowing Triton’s grid scheduler to streamline kernel execution.

These steps balance compute intensity and achieve near‑optimal hardware utilization.

Benchmark evaluation

NSA was evaluated against full‑attention baselines and state‑of‑the‑art sparse methods on three fronts:

General pre‑training loss: NSA’s loss curve is smoother and consistently lower than full attention.

Long‑context tasks: on a 64k “needle‑in‑a‑haystack” benchmark, NSA’s hierarchical design yields high retrieval precision.

Chain‑of‑thought reasoning: using a 27B model (3B active parameters) fine‑tuned on 100 B 32k math trajectories, NSA‑R outperforms Full‑Attention‑R by 0.075 accuracy at 8k context and 0.054 at 16k.

On the LongBench suite, NSA achieves the highest average score of 0.469, surpassing all competitors.

Comparison with prior work

Previous sparse attention approaches (KV‑cache eviction, block selection, sampling/ hashing) focus mainly on inference and lack training support. NSA addresses both phases, delivering end‑to‑end speed gains and lower pre‑training compute.

Additionally, the paper validates an early Tsinghua Yao‑class study on complex arithmetic: NSA reduces required tokens from 9,392 to 2,275 for a four‑digit multiplication task, delivering the correct answer where the baseline fails.

Conclusion and outlook

NSA demonstrates that a well‑designed sparse attention can outperform dense attention across multiple metrics while being hardware‑friendly. Future DeepSeek research is expected to further refine long‑text and code‑base analysis to boost practical reasoning capabilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer DeepSeek benchmark LLM optimization NSA Sparse Attention inference speedup

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.