How DeepSeek V4’s CSA + HCA Break the Million‑Token Barrier
Traditional full‑attention cannot handle million‑token contexts due to exponential compute and memory growth, but DeepSeek V4’s Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) compress, sparsely index, and precisely compute tokens, cutting KV cache to 10% and FLOPs to 27% while enabling a 1‑M token window on a single GPU.
Why Traditional Attention Fails at Million‑Token Contexts
Full‑attention requires calculating the relationship between every pair of tokens, which works for short contexts (4K–32K tokens) but explodes at larger scales. When the context grows to 1 M tokens, compute increases by 65,536× and KV‑cache memory by 256×, making it impossible even for top‑tier GPUs.
CSA + HCA: DeepSeek V4’s Paradigm Shift
On 24 April 2026 DeepSeek released the V4 preview, introducing Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) into the mainstream LLM architecture. The core idea is no longer “compute every pair”; instead the pipeline is “compress → select → precise compute”.
CSA – The “Meeting Recorder”
Intuitive Analogy
Imagine a 1 000‑person meeting where each participant’s remarks must be related to every other. Traditional attention lets everyone cross‑talk, while CSA appoints “recorders” that summarise adjacent groups of m tokens into weighted semantic compressions.
Three‑Step Process
Step 1: Compress
[Token 1, Token 2, ..., Token m] → Recorder A → Compressed Summary 1
[Token m+1, ..., Token 2m] → Recorder A & B → Summary 1' + Summary 2The same token group is recorded twice, preserving two perspectives via a Hadamard product.
Step 2: Lightning Indexing – each recorder quickly scores the relevance of its summary to the current token and selects the top‑k highest‑scoring summaries (a “lightning‑fast” sparse top‑k sampling).
Step 3: Precise Compute – standard attention is performed only on the selected k summaries.
Learnable Compression Weights
The compression matrix S is a learnable parameter consisting of four W matrices and two bias matrices, allowing the model to discover which semantic dimensions are important and which can be discarded.
HCA – The “Shorthand Note‑Taker”
Intuitive Analogy
If CSA is a meeting recorder, HCA is a shorthand note‑taker handling far larger groups (m′ ≫ m). It focuses on a single large block of the meeting and produces a concise summary without needing top‑k selection.
Two‑Step Process
Step 1: Coarse Compression
[Token 1, ..., Token m'] → Shorthand Taker → Shorthand SummaryStep 2: Exhaustive Lookup – because the number of summaries is already small, HCA scans each one directly.
Key Differences Between CSA and HCA
Group size: CSA uses fine‑grained groups of m tokens; HCA uses coarser groups of m′ tokens (m′ > m).
Compression range: CSA concatenates two adjacent groups; HCA compresses a single group independently.
Indexing strategy: CSA needs top‑k sparse sampling; HCA performs direct exhaustive lookup.
Applicable stage: CSA refines during the precise‑compute phase; HCA quickly locates relevant large blocks.
How CSA and HCA Work Together
DeepSeek V4 adopts a three‑layer architecture:
HCA first : rapidly locate large‑scale relevant information in the million‑token context (coarse filtering).
CSA next : finely compress the blocks identified by HCA and perform top‑k sparse sampling.
Standard attention finally : compute exact attention on the high‑value token pairs.
The workflow is illustrated below:
HCA = keyword‑matching search engine.
CSA = relevance‑ranking engine.
Standard attention = precise answer generation.
Measured Impact: KV Cache Down to 10 %
The CSA + HCA hybrid architecture yields dramatic improvements compared with DeepSeek V3.2:
KV‑cache size reduced to 10 % of the V3.2 baseline.
Inference FLOPs reduced to 27 % of the baseline.
Context window expanded to 1 M tokens (no prior baseline).
Consequences:
Running a million‑token context now requires a single H100 GPU instead of ten.
KV‑cache memory drops from 256 GB to 25.6 GB.
Inference time for a million‑token prompt falls from 65 s to under 20 s.
Engineering Details: Quantization & Heterogeneous KV Cache
Per‑Module Quantization
V4 applies module‑wise quantization:
MoE expert weights: FP4 (largest parameters, biggest compression gain).
CSA lightning‑index query/key: FP4 (critical long‑context hot path).
CSA lightning‑index scores: BF16 (sensitive to numeric precision for ranking).
Heterogeneous KV Cache Architecture
Traditional PagedAttention assumes identical KV shapes across layers, an assumption broken by V4 because CSA/HCA compression, sliding‑window attention, and tail‑state KV structures differ.
DeepSeek redesigns KV storage by:
Separating KV tensors by type for independent storage.
Introducing three‑tier trade‑off schemes for sliding‑window attention.
Ensuring persistence mechanisms remain functional under the mixed architecture.
Is This a Real Innovation?
While sparse attention is not new, CSA/HCA differ in three ways:
Learnable compression weights replace hand‑crafted sparsity rules.
Dual‑layer cooperation (coarse‑fine) offers more flexibility than a single sparse method.
Integration with mHC (manifold‑constrained hyper‑connectivity) addresses signal amplification in trillion‑parameter training.
The authors acknowledge that the V4 preview is a “towards” effort; community validation of sparse‑training performance across tasks remains pending.
Who Should Pay Attention?
AI developers – to understand next‑generation attention design.
Long‑context application developers – million‑token contexts become engineering‑feasible.
Efficiency engineers – 90 % KV‑cache compression methodology.
Academic researchers – learnable compression and dual‑layer cooperation as a new paradigm.
Summary
CSA + HCA is the core innovation that lets DeepSeek V4 handle million‑token contexts:
Theoretical level: transition from O(L²) full attention to hierarchical sparse‑compressed attention.
Algorithmic level: CSA (fine‑grained + top‑k) + HCA (coarse‑grained + exhaustive) cooperate.
Engineering level: FP4 quantization, heterogeneous KV cache, and TileLang‑optimized operators.
Result: 1 M token context that is both computable and storable.
This is not a minor tweak but a paradigm‑level reconstruction of the attention mechanism, shifting the competitive focus from model size to attention efficiency.
References
DeepSeek V4 technical report (arXiv).
DeepSeek official Hugging Face page.
Various technical blog analyses.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Lao Guo's Learning Space
AI learning, discussion, and hands‑on practice with self‑reflection
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
