Boost LLM Inference Speed with KV‑Cache Reuse and Speculative Sampling

This article explains two production‑grade optimization techniques for large language model inference—KV‑cache reuse across multi‑turn dialogues and speculative sampling with a small draft model—detailing their design, implementation, and performance impact.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Boost LLM Inference Speed with KV‑Cache Reuse and Speculative Sampling

Background

RTP-LLM is a high‑performance inference engine developed by Alibaba's large‑model prediction team, compatible with many mainstream models and using CUDA kernels to implement optimizations such as PagedAttention and Continuous Batching. It also supports multimodal, LoRA, P‑Tuning, and weight‑only quantization.

KV‑Cache Reuse for Multi‑Turn Dialogues

In Taobao Q&A and LangChain scenarios, request length grows with each turn, increasing First Token Time (FTT). Because the attention mask is lower‑triangular, the KV cache for a shared prefix is identical across turns. By storing the KV cache from the previous turn and reusing it, the number of tokens that need new KV cache generation is reduced, shortening FTT.

To reuse KV cache, the same machine must hold the cache, which is difficult in a clustered deployment. The solution adds a forwarding layer that carries a unique identifier with each request, hashes it to a fixed machine, and ensures subsequent turns hit the same GPU. Distributed storage (e.g., a hash‑based key‑value store) can record the mapping.

Implementation reuses the PTuning v2 operator to share KV cache parameters.

Experiments on Qwen‑13B/int8 on A10 GPUs show that KV‑cache reuse dramatically reduces FTT, and the effect of historical length on FTT is minor. The technique also benefits PTuning prefixes and long system prompts, lowering both latency and memory usage. When memory pressure is high, an LRU policy evicts older KV caches; future work may move expired caches to host memory.

Speculative Sampling

Speculative sampling, introduced in 2022, exploits the fact that some tokens are easy to generate and can be produced by a small draft model, while the large model spends most of its time loading weights. The workflow generates N tokens with the small model, then the large model verifies them, accepting a subset. This can multiply inference speed without degrading output quality.

In RTP‑LLM we wrapped the two‑model pipeline in an orchestration layer that provides a unified API while keeping the optimization orthogonal to other speed‑up techniques.

Performance evaluation shows that the extra cost comes from the small model’s sequential token generation and the additional sampling steps. On A10 GPUs, the small model’s lm_head becomes a bottleneck as model size shrinks. Optimized sampling kernels reduce the overhead to one‑tenth of the original HuggingFace implementation.

Performance Evaluation

Benchmarks on shop‑name generation and copy‑writing tasks (int8‑quantized Qwen‑13B vs. speculative‑sampling variants) indicate that speculative sampling accelerates generation for most token‑acceptance scenarios, with only the worst‑case (all tokens rejected) lagging behind the baseline.

Other Considerations

Beyond latency, parallelism is limited by GPU memory: model weights, runtime buffers, and KV cache. Techniques like FlashAttention reduce softmax buffer memory but are incompatible with KV‑cache reuse and speculative sampling due to differing QKV dimensions. Speculative sampling also requires extra memory for the small model and its KV cache.

Conclusion

The presented optimizations—KV‑cache reuse and speculative sampling—provide measurable speed‑up for large‑model inference in production, though further work remains at both the operator and framework levels. RTP‑LLM continues to integrate advances from FasterTransformer, TensorRT‑LLM, FlashAttention2, Cutlass, vLLM, HuggingFace Transformers, Medusa, LLaVA, and Qwen‑VL.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AILLMInference OptimizationSpeculative SamplingKV cache
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.