20 min read

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report analyzes the memory bottleneck of DeepSeek‑V3.2‑Exp’s sparse‑attention decoder, proposes the Expanded Sparse Server (ESS) to offload the latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach dramatically improves decode throughput while keeping latency within acceptable limits.

Baidu Geek Talk

Dec 10, 2025

How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

Background and Motivation

DeepSeek‑V3.2‑Exp employs sparse‑attention to lower inference latency for long contexts. In the PD‑separated architecture the KV‑latent cache expands linearly with sequence length, quickly exhausting GPU memory. The resulting memory bottleneck limits the batch size to about 52, yielding a decode throughput of ~9,647 tokens/s.

Because the latent cache is accessed with strong temporal locality, moving a portion of it to CPU memory while keeping latency within acceptable bounds can increase the batch size and overall throughput.

Key Contributions

Systematic evaluation of latent‑cache offload feasibility and performance bounds for DeepSeek‑V3.2‑Exp.

Design of the Expanded Sparse Server (ESS), which offloads the latent cache and preserves decode throughput.

Development of a high‑fidelity simulator that models computation, communication, and offload‑prefetch overheads for realistic industrial workloads.

ESS Architecture

ESS offloads only the latent cache; the indexer cache (≈ 16.8 % of total computation) remains on‑GPU. The offload‑prefetch trigger follows the PD‑separated workflow (see Fig 4).

Small‑Block Data Transfer

Latent‑cache blocks are 656 B, leading to fragmented PCIe traffic. ESS introduces FlashTrans , a CUDA operator that uses Unified Virtual Addressing (UVA) to access pinned CPU memory directly, eliminating frequent cudaMemcpyAsync calls. Measured effective bandwidths are ~37 GB/s (host‑to‑device) and ~43 GB/s (device‑to‑host).

Cache‑Hit Guarantees

An LRU engine manages GPU‑side cache evictions. LRU‑Warmup pre‑loads the top‑2 K latent‑cache indices from the prefilling stage into the GPU LRU, dramatically reducing early‑stage cache misses (Fig 5‑6).

Cache‑Miss Analysis

Both inter‑layer and intra‑layer accesses show high similarity, confirming strong temporal locality. Layer‑wise miss statistics (Fig 7‑9) guide selective prefetching and the choice of sparse‑memory ratios.

Computation‑Communication Overlap

Without Overlap : baseline SGLang implementation; H2D/D2H transfers block the GPU.

DA Overlap : separates pre‑attention from the indexer, allowing host‑to‑device prefetch to overlap with attention computation.

DBA Overlap : splits the indexer across the batch dimension, enabling larger portions of computation to overlap with data transfer, especially when cache‑miss counts are high.

Layerwise Overlap selects the most effective strategy per layer based on miss count and context length (Fig 10‑11).

Scalability Across Context Lengths

Simulation shows that with a sparse‑memory ratio ≥ 0.2, average cache misses stay stable as context length grows. Longer contexts permit smaller ratios and larger batch sizes, achieving up to 123 % throughput improvement for 128 K context (Fig 13).

Simulation Validation

The high‑fidelity simulator, calibrated with real‑machine metadata, evaluates end‑to‑end performance. Results indicate that increasing MTP from 2 to 4 yields a 69 % throughput gain, and the ESS offload‑prefetch contributes an additional ~70 % of the total gain (Table 2).

Conclusion and Outlook

ESS demonstrates that offloading the latent cache can substantially increase batch size and decode throughput without loss of accuracy. Future work includes integrating ESS into production inference frameworks, extending the approach to other KV‑cache‑based models, and exploring combinations with lossy compression techniques such as SnapKV.