Baidu Geek Talk
Dec 10, 2025 · Artificial Intelligence
How Offloading Latent Cache Boosts DeepSeek‑V3.2‑Exp Decoding Throughput
This report analyzes the memory bottleneck of DeepSeek‑V3.2‑Exp’s sparse‑attention decoder, proposes the Expanded Sparse Server (ESS) to offload the latent cache to CPU memory, and demonstrates through high‑fidelity simulation that the approach dramatically improves decode throughput while keeping latency within acceptable limits.
Cache offloadGPU memoryLLM inference
0 likes · 20 min read
