Boost LLM Inference Speed with KV‑Cache Reuse and Speculative Sampling
This article explains two production‑grade optimization techniques for large language model inference—KV‑cache reuse across multi‑turn dialogues and speculative sampling with a small draft model—detailing their design, implementation, and performance impact.
Background
RTP-LLM is a high‑performance inference engine developed by Alibaba's large‑model prediction team, compatible with many mainstream models and using CUDA kernels to implement optimizations such as PagedAttention and Continuous Batching. It also supports multimodal, LoRA, P‑Tuning, and weight‑only quantization.
KV‑Cache Reuse for Multi‑Turn Dialogues
In Taobao Q&A and LangChain scenarios, request length grows with each turn, increasing First Token Time (FTT). Because the attention mask is lower‑triangular, the KV cache for a shared prefix is identical across turns. By storing the KV cache from the previous turn and reusing it, the number of tokens that need new KV cache generation is reduced, shortening FTT.
To reuse KV cache, the same machine must hold the cache, which is difficult in a clustered deployment. The solution adds a forwarding layer that carries a unique identifier with each request, hashes it to a fixed machine, and ensures subsequent turns hit the same GPU. Distributed storage (e.g., a hash‑based key‑value store) can record the mapping.
Implementation reuses the PTuning v2 operator to share KV cache parameters.
Experiments on Qwen‑13B/int8 on A10 GPUs show that KV‑cache reuse dramatically reduces FTT, and the effect of historical length on FTT is minor. The technique also benefits PTuning prefixes and long system prompts, lowering both latency and memory usage. When memory pressure is high, an LRU policy evicts older KV caches; future work may move expired caches to host memory.
Speculative Sampling
Speculative sampling, introduced in 2022, exploits the fact that some tokens are easy to generate and can be produced by a small draft model, while the large model spends most of its time loading weights. The workflow generates N tokens with the small model, then the large model verifies them, accepting a subset. This can multiply inference speed without degrading output quality.
In RTP‑LLM we wrapped the two‑model pipeline in an orchestration layer that provides a unified API while keeping the optimization orthogonal to other speed‑up techniques.
Performance evaluation shows that the extra cost comes from the small model’s sequential token generation and the additional sampling steps. On A10 GPUs, the small model’s lm_head becomes a bottleneck as model size shrinks. Optimized sampling kernels reduce the overhead to one‑tenth of the original HuggingFace implementation.
Performance Evaluation
Benchmarks on shop‑name generation and copy‑writing tasks (int8‑quantized Qwen‑13B vs. speculative‑sampling variants) indicate that speculative sampling accelerates generation for most token‑acceptance scenarios, with only the worst‑case (all tokens rejected) lagging behind the baseline.
Other Considerations
Beyond latency, parallelism is limited by GPU memory: model weights, runtime buffers, and KV cache. Techniques like FlashAttention reduce softmax buffer memory but are incompatible with KV‑cache reuse and speculative sampling due to differing QKV dimensions. Speculative sampling also requires extra memory for the small model and its KV cache.
Conclusion
The presented optimizations—KV‑cache reuse and speculative sampling—provide measurable speed‑up for large‑model inference in production, though further work remains at both the operator and framework levels. RTP‑LLM continues to integrate advances from FasterTransformer, TensorRT‑LLM, FlashAttention2, Cutlass, vLLM, HuggingFace Transformers, Medusa, LLaVA, and Qwen‑VL.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
