How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

This report details the analysis of memory bottlenecks in DeepSeek‑V3.2‑Exp, proposes the Expanded Sparse Server (ESS) that offloads latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach, combined with cache‑warmup and overlap techniques, can double decoding throughput for long‑context inference.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
How Offloading Latent Cache to CPU Boosts DeepSeek‑V3.2‑Exp Decoding Throughput

Introduction

DeepSeek‑V3.2‑Exp uses sparse attention to lower inference latency for long contexts, but in a PD‑separated architecture the latent cache grows linearly with sequence length. This exhausts GPU memory, limits batch size, and throttles decode throughput.

Problem Background

GPU Memory Bottleneck

Simulation shows that with the current hardware the batch size can only reach 52, yielding 9,647 tokens/s—far below the theoretical ceiling. The limited GPU memory directly caps decode throughput.

Temporal Locality of Latent Cache

Latency analysis demonstrates strong inter‑layer and intra‑layer locality in latent‑cache accesses, indicating that offloading to CPU memory is feasible without overwhelming PCIe bandwidth.

Contributions

Systematic evaluation of latent‑cache offload feasibility and benefit boundaries for DeepSeek‑V3.2‑Exp.

Design of the Expanded Sparse Server (ESS) that offloads the latent cache while preserving decode throughput.

Construction of a high‑fidelity simulator to evaluate optimization strategies in realistic industrial settings.

ESS Design and Analysis

ESS offloads only the latent cache (the indexer cache accounts for only 16.8 % of compute). An offload‑prefetch mechanism decouples storage and compute paths, enabling significant throughput gains and compatibility with existing optimizations such as MTP and Two‑Batch Overlap.

Small‑Block Data Transfer

Latent‑cache blocks are 656 bytes and scattered, causing fragmented PCIe transfers. FlashTrans, built on Unified Virtual Addressing (UVA), provides address‑driven on‑demand transfers, achieving 37 GB/s H2D and 43 GB/s D2H bandwidth for latent‑cache data.

Cache‑Hit Guarantees

ESS uses an LRU replacement engine and introduces LRU‑Warmup, which pre‑loads the top‑2K latent‑cache indices from the prefilling stage into the LRU, dramatically reducing early‑stage cache misses.

Computation‑Communication Overlap

Three overlap strategies are evaluated:

Without Overlap : baseline SGLang implementation with serial execution.

DA Overlap : splits forward_prepare into independent pre‑attention and dependent indexer phases, and divides SparseMLA into Attn0 (using existing cache) and Attn1 (after H2D), allowing parallelism.

DBA Overlap : further partitions the indexer along the batch dimension, enabling half of the indexer work to overlap with data transfer, improving scalability for long contexts.

Layer‑wise analysis shows that the optimal strategy depends on cache‑miss count and context length.

Scalability Across Context Lengths

With a sparse‑memory ratio ≥ 0.2, average cache misses remain stable as context length grows. Very small GPU buffers cause severe misses at 32 K context. A minimum sparse‑memory pool of 6.4 K slots keeps average misses below 200, enabling effective overlap.

Simulation Verification

Simulator

A high‑fidelity internal simulator, calibrated with real‑machine metadata, models compute and transfer flows, including MTP and dual‑stream optimizations, allowing accurate performance prediction without extensive physical experiments.

End‑to‑End Performance Evaluation

For 32 K context, MTP = 2 yields a 69.4 % throughput increase; MTP = 4 with acceptance = 3.4 gives a 45.8 % improvement. For 128 K context (MTP = 2, acceptance = 1.7, sparse‑memory ratio = 0.1), ESS achieves a 123 % throughput boost.

Conclusion and Outlook

ESS demonstrates that offload‑prefetch can safely enlarge batch size and dramatically improve decode throughput for large language models. Future work includes integrating ESS into production frameworks, extending it to other KV‑cache compression schemes, and exploring hybrid approaches with lossy compression such as SnapKV.

LLM inferencePerformance ScalingSparse AttentionCache offloadGPU‑CPU optimization
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.