How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput
This report details the analysis of memory bottlenecks in DeepSeek‑V3.2‑Exp, proposes the Expanded Sparse Server (ESS) that offloads latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach, combined with cache‑warmup and overlap techniques, can double decoding throughput for long‑context inference.
Introduction
DeepSeek‑V3.2‑Exp uses sparse attention to lower inference latency for long contexts, but in a PD‑separated architecture the latent cache grows linearly with sequence length. This exhausts GPU memory, limits batch size, and throttles decode throughput.
Problem Background
GPU Memory Bottleneck
Simulation shows that with the current hardware the batch size can only reach 52, yielding 9,647 tokens/s—far below the theoretical ceiling. The limited GPU memory directly caps decode throughput.
Temporal Locality of Latent Cache
Latency analysis demonstrates strong inter‑layer and intra‑layer locality in latent‑cache accesses, indicating that offloading to CPU memory is feasible without overwhelming PCIe bandwidth.
Contributions
Systematic evaluation of latent‑cache offload feasibility and benefit boundaries for DeepSeek‑V3.2‑Exp.
Design of the Expanded Sparse Server (ESS) that offloads the latent cache while preserving decode throughput.
Construction of a high‑fidelity simulator to evaluate optimization strategies in realistic industrial settings.
ESS Design and Analysis
ESS offloads only the latent cache (the indexer cache accounts for only 16.8 % of compute). An offload‑prefetch mechanism decouples storage and compute paths, enabling significant throughput gains and compatibility with existing optimizations such as MTP and Two‑Batch Overlap.
Small‑Block Data Transfer
Latent‑cache blocks are 656 bytes and scattered, causing fragmented PCIe transfers. FlashTrans, built on Unified Virtual Addressing (UVA), provides address‑driven on‑demand transfers, achieving 37 GB/s H2D and 43 GB/s D2H bandwidth for latent‑cache data.
Cache‑Hit Guarantees
ESS uses an LRU replacement engine and introduces LRU‑Warmup, which pre‑loads the top‑2K latent‑cache indices from the prefilling stage into the LRU, dramatically reducing early‑stage cache misses.
Computation‑Communication Overlap
Three overlap strategies are evaluated:
Without Overlap : baseline SGLang implementation with serial execution.
DA Overlap : splits forward_prepare into independent pre‑attention and dependent indexer phases, and divides SparseMLA into Attn0 (using existing cache) and Attn1 (after H2D), allowing parallelism.
DBA Overlap : further partitions the indexer along the batch dimension, enabling half of the indexer work to overlap with data transfer, improving scalability for long contexts.
Layer‑wise analysis shows that the optimal strategy depends on cache‑miss count and context length.
Scalability Across Context Lengths
With a sparse‑memory ratio ≥ 0.2, average cache misses remain stable as context length grows. Very small GPU buffers cause severe misses at 32 K context. A minimum sparse‑memory pool of 6.4 K slots keeps average misses below 200, enabling effective overlap.
Simulation Verification
Simulator
A high‑fidelity internal simulator, calibrated with real‑machine metadata, models compute and transfer flows, including MTP and dual‑stream optimizations, allowing accurate performance prediction without extensive physical experiments.
End‑to‑End Performance Evaluation
For 32 K context, MTP = 2 yields a 69.4 % throughput increase; MTP = 4 with acceptance = 3.4 gives a 45.8 % improvement. For 128 K context (MTP = 2, acceptance = 1.7, sparse‑memory ratio = 0.1), ESS achieves a 123 % throughput boost.
Conclusion and Outlook
ESS demonstrates that offload‑prefetch can safely enlarge batch size and dramatically improve decode throughput for large language models. Future work includes integrating ESS into production frameworks, extending it to other KV‑cache compression schemes, and exploring hybrid approaches with lossy compression such as SnapKV.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
