How DeepSeek V4’s CSA + HCA Break the Million‑Token Barrier

Traditional full‑attention cannot handle million‑token contexts due to exponential compute and memory growth, but DeepSeek V4’s Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) compress, sparsely index, and precisely compute tokens, cutting KV cache to 10% and FLOPs to 27% while enabling a 1‑M token window on a single GPU.

Lao Guo's Learning Space
Lao Guo's Learning Space
Lao Guo's Learning Space
How DeepSeek V4’s CSA + HCA Break the Million‑Token Barrier

Why Traditional Attention Fails at Million‑Token Contexts

Full‑attention requires calculating the relationship between every pair of tokens, which works for short contexts (4K–32K tokens) but explodes at larger scales. When the context grows to 1 M tokens, compute increases by 65,536× and KV‑cache memory by 256×, making it impossible even for top‑tier GPUs.

CSA + HCA: DeepSeek V4’s Paradigm Shift

On 24 April 2026 DeepSeek released the V4 preview, introducing Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) into the mainstream LLM architecture. The core idea is no longer “compute every pair”; instead the pipeline is “compress → select → precise compute”.

CSA – The “Meeting Recorder”

Intuitive Analogy

Imagine a 1 000‑person meeting where each participant’s remarks must be related to every other. Traditional attention lets everyone cross‑talk, while CSA appoints “recorders” that summarise adjacent groups of m tokens into weighted semantic compressions.

Three‑Step Process

Step 1: Compress

[Token 1, Token 2, ..., Token m] → Recorder A → Compressed Summary 1
[Token m+1, ..., Token 2m] → Recorder A & B → Summary 1' + Summary 2

The same token group is recorded twice, preserving two perspectives via a Hadamard product.

Step 2: Lightning Indexing – each recorder quickly scores the relevance of its summary to the current token and selects the top‑k highest‑scoring summaries (a “lightning‑fast” sparse top‑k sampling).

Step 3: Precise Compute – standard attention is performed only on the selected k summaries.

Learnable Compression Weights

The compression matrix S is a learnable parameter consisting of four W matrices and two bias matrices, allowing the model to discover which semantic dimensions are important and which can be discarded.

HCA – The “Shorthand Note‑Taker”

Intuitive Analogy

If CSA is a meeting recorder, HCA is a shorthand note‑taker handling far larger groups (m′ ≫ m). It focuses on a single large block of the meeting and produces a concise summary without needing top‑k selection.

Two‑Step Process

Step 1: Coarse Compression

[Token 1, ..., Token m'] → Shorthand Taker → Shorthand Summary

Step 2: Exhaustive Lookup – because the number of summaries is already small, HCA scans each one directly.

Key Differences Between CSA and HCA

Group size: CSA uses fine‑grained groups of m tokens; HCA uses coarser groups of m′ tokens (m′ > m).

Compression range: CSA concatenates two adjacent groups; HCA compresses a single group independently.

Indexing strategy: CSA needs top‑k sparse sampling; HCA performs direct exhaustive lookup.

Applicable stage: CSA refines during the precise‑compute phase; HCA quickly locates relevant large blocks.

How CSA and HCA Work Together

DeepSeek V4 adopts a three‑layer architecture:

HCA first : rapidly locate large‑scale relevant information in the million‑token context (coarse filtering).

CSA next : finely compress the blocks identified by HCA and perform top‑k sparse sampling.

Standard attention finally : compute exact attention on the high‑value token pairs.

The workflow is illustrated below:

HCA = keyword‑matching search engine.

CSA = relevance‑ranking engine.

Standard attention = precise answer generation.

Measured Impact: KV Cache Down to 10 %

The CSA + HCA hybrid architecture yields dramatic improvements compared with DeepSeek V3.2:

KV‑cache size reduced to 10 % of the V3.2 baseline.

Inference FLOPs reduced to 27 % of the baseline.

Context window expanded to 1 M tokens (no prior baseline).

Consequences:

Running a million‑token context now requires a single H100 GPU instead of ten.

KV‑cache memory drops from 256 GB to 25.6 GB.

Inference time for a million‑token prompt falls from 65 s to under 20 s.

Engineering Details: Quantization & Heterogeneous KV Cache

Per‑Module Quantization

V4 applies module‑wise quantization:

MoE expert weights: FP4 (largest parameters, biggest compression gain).

CSA lightning‑index query/key: FP4 (critical long‑context hot path).

CSA lightning‑index scores: BF16 (sensitive to numeric precision for ranking).

Heterogeneous KV Cache Architecture

Traditional PagedAttention assumes identical KV shapes across layers, an assumption broken by V4 because CSA/HCA compression, sliding‑window attention, and tail‑state KV structures differ.

DeepSeek redesigns KV storage by:

Separating KV tensors by type for independent storage.

Introducing three‑tier trade‑off schemes for sliding‑window attention.

Ensuring persistence mechanisms remain functional under the mixed architecture.

Is This a Real Innovation?

While sparse attention is not new, CSA/HCA differ in three ways:

Learnable compression weights replace hand‑crafted sparsity rules.

Dual‑layer cooperation (coarse‑fine) offers more flexibility than a single sparse method.

Integration with mHC (manifold‑constrained hyper‑connectivity) addresses signal amplification in trillion‑parameter training.

The authors acknowledge that the V4 preview is a “towards” effort; community validation of sparse‑training performance across tasks remains pending.

Who Should Pay Attention?

AI developers – to understand next‑generation attention design.

Long‑context application developers – million‑token contexts become engineering‑feasible.

Efficiency engineers – 90 % KV‑cache compression methodology.

Academic researchers – learnable compression and dual‑layer cooperation as a new paradigm.

Summary

CSA + HCA is the core innovation that lets DeepSeek V4 handle million‑token contexts:

Theoretical level: transition from O(L²) full attention to hierarchical sparse‑compressed attention.

Algorithmic level: CSA (fine‑grained + top‑k) + HCA (coarse‑grained + exhaustive) cooperate.

Engineering level: FP4 quantization, heterogeneous KV cache, and TileLang‑optimized operators.

Result: 1 M token context that is both computable and storable.

This is not a minor tweak but a paradigm‑level reconstruction of the attention mechanism, shifting the competitive focus from model size to attention efficiency.

References

DeepSeek V4 technical report (arXiv).

DeepSeek official Hugging Face page.

Various technical blog analyses.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsAttention MechanismSparse Attentionmillion-token contextCSAHCAKV cache compression
Lao Guo's Learning Space
Written by

Lao Guo's Learning Space

AI learning, discussion, and hands‑on practice with self‑reflection

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.