How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput
This report analyzes the memory bottleneck of DeepSeek‑V3.2‑Exp’s sparse‑attention decoder, proposes the Expanded Sparse Server (ESS) to offload the latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach dramatically improves decode throughput while keeping latency within acceptable limits.
Background and Motivation
DeepSeek‑V3.2‑Exp employs sparse‑attention to lower inference latency for long contexts. In the PD‑separated architecture the KV‑latent cache expands linearly with sequence length, quickly exhausting GPU memory. The resulting memory bottleneck limits the batch size to about 52, yielding a decode throughput of ~9,647 tokens/s.
Because the latent cache is accessed with strong temporal locality, moving a portion of it to CPU memory while keeping latency within acceptable bounds can increase the batch size and overall throughput.
Key Contributions
Systematic evaluation of latent‑cache offload feasibility and performance bounds for DeepSeek‑V3.2‑Exp.
Design of the Expanded Sparse Server (ESS), which offloads the latent cache and preserves decode throughput.
Development of a high‑fidelity simulator that models computation, communication, and offload‑prefetch overheads for realistic industrial workloads.
ESS Architecture
ESS offloads only the latent cache; the indexer cache (≈ 16.8 % of total computation) remains on‑GPU. The offload‑prefetch trigger follows the PD‑separated workflow (see Fig 4).
Small‑Block Data Transfer
Latent‑cache blocks are 656 B, leading to fragmented PCIe traffic. ESS introduces FlashTrans , a CUDA operator that uses Unified Virtual Addressing (UVA) to access pinned CPU memory directly, eliminating frequent cudaMemcpyAsync calls. Measured effective bandwidths are ~37 GB/s (host‑to‑device) and ~43 GB/s (device‑to‑host).
Cache‑Hit Guarantees
An LRU engine manages GPU‑side cache evictions. LRU‑Warmup pre‑loads the top‑2 K latent‑cache indices from the prefilling stage into the GPU LRU, dramatically reducing early‑stage cache misses (Fig 5‑6).
Cache‑Miss Analysis
Both inter‑layer and intra‑layer accesses show high similarity, confirming strong temporal locality. Layer‑wise miss statistics (Fig 7‑9) guide selective prefetching and the choice of sparse‑memory ratios.
Computation‑Communication Overlap
Without Overlap : baseline SGLang implementation; H2D/D2H transfers block the GPU.
DA Overlap : separates pre‑attention from the indexer, allowing host‑to‑device prefetch to overlap with attention computation.
DBA Overlap : splits the indexer across the batch dimension, enabling larger portions of computation to overlap with data transfer, especially when cache‑miss counts are high.
Layerwise Overlap selects the most effective strategy per layer based on miss count and context length (Fig 10‑11).
Scalability Across Context Lengths
Simulation shows that with a sparse‑memory ratio ≥ 0.2, average cache misses stay stable as context length grows. Longer contexts permit smaller ratios and larger batch sizes, achieving up to 123 % throughput improvement for 128 K context (Fig 13).
Simulation Validation
The high‑fidelity simulator, calibrated with real‑machine metadata, evaluates end‑to‑end performance. Results indicate that increasing MTP from 2 to 4 yields a 69 % throughput gain, and the ESS offload‑prefetch contributes an additional ~70 % of the total gain (Table 2).
Conclusion and Outlook
ESS demonstrates that offloading the latent cache can substantially increase batch size and decode throughput without loss of accuracy. Future work includes integrating ESS into production inference frameworks, extending the approach to other KV‑cache‑based models, and exploring combinations with lossy compression techniques such as SnapKV.
Key Figures
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
