Artificial Intelligence 7 min read

How DeepSeek‑V4 Achieves Million‑Token Context via Aggressive KV‑Cache Compression

DeepSeek‑V4 reaches a million‑token context window by aggressively compressing its KV‑cache and employing a hybrid attention scheme that combines Compressed Sparse Attention (CSA) for selective top‑k retrieval with Heavily Compressed Attention (HCA) for full‑attention over heavily merged entries, alongside mixed‑precision storage and other engineering optimizations.

Network Intelligence Research Center (NIRC)

Jun 4, 2026

How DeepSeek‑V4 Achieves Million‑Token Context via Aggressive KV‑Cache Compression

KV Cache Overview

To understand KV cache, one must first recall the classic multi‑head self‑attention mechanism used in autoregressive generation, where each new token is produced based on the current query and the keys/values of all previous tokens. Caching these keys and values avoids recomputing them but causes memory to grow linearly with context length, making a million‑token window infeasible without compression.

CSA: Light Compression + Sparse Attention

CSA (Compressed Sparse Attention) first compresses the KV cache by grouping every m consecutive tokens into a single, tighter KV entry, reducing the effective length from n to roughly n/m. Because compression alone is insufficient for a million‑token context, CSA adds a sparse selection step: a lightweight indexer picks the top‑ k most relevant compressed entries for the current query, and attention is computed only on those. To preserve fine‑grained recent information, a small sliding window of uncompressed KV entries is kept.

HCA: Heavy Compression + Full Attention

HCA (Heavily Compressed Attention) compresses the KV cache even more aggressively: a much larger block of m' tokens (where m' >> m) is merged into a single KV entry. Unlike CSA, HCA does not perform top‑k selection; the compression is so strong that full attention can be applied directly on the compressed entries, sacrificing some granularity for lower storage and compute cost while still covering the entire history.

Why Combine CSA and HCA?

CSA and HCA address different needs. CSA acts like fine‑grained retrieval, keeping detailed long‑range information through top‑k selection, while HCA provides a coarse global summary by covering the whole history with heavily merged entries. DeepSeek‑V4 interleaves the two into a hybrid attention structure, enabling efficient extraction of important distant information and low‑cost retention of a global context.

Additional engineering optimizations include mixed‑precision storage of KV entries (RoPE‑related dimensions in BF16, other parts in FP8), indexer attention computed in FP4, a heterogeneous KV‑cache layout, and optional disk‑backed KV storage for shared prefixes.

Other New Features

Beyond the hybrid attention, DeepSeek‑V4 introduces two notable updates:

Manifold‑Constrained Hyper‑Connections (mHC) : an enhanced residual‑connection design that allows more flexible inter‑layer information flow and improves training stability through constrained residual mappings.

Muon optimizer : replaces AdamW in most modules to accelerate convergence and stabilize large‑model training.

Details About KV Cache Entries

DeepSeek‑V4 does not store separate traditional keys and values nor the latent cache used in previous models. Instead, it uses a redesigned shared KV entry that simultaneously serves as both key and value, enabling the million‑token context without merely shrinking the original KV structures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer mixed precision KV cache DeepSeek V4 Compressed Sparse Attention Heavily Compressed Attention

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.