How Speculative Decoding Supercharges Large Language Model Inference
This survey examines speculative decoding—a draft‑then‑verify technique that parallelizes token generation to cut LLM inference latency, outlines its core components, compares independent and self‑drafting methods, discusses verification strategies, and highlights open research challenges.
Background
Large language models (LLMs) increasingly suffer from inference latency that is dominated by memory‑bandwidth limits: each decoding step repeatedly moves billions of parameters between GPU high‑bandwidth memory and cache. Autoregressive decoding generates a single token per step, leading to low GPU utilization and a generation time that grows linearly with sequence length.
Speculative Decoding
Speculative decoding follows a draft‑then‑verify paradigm. At each step the draft model predicts several future tokens; the target LLM then validates them in parallel. Tokens that match the target’s predictions are emitted, reducing the total number of decoding steps while guaranteeing that the final output is identical to that of the target LLM under exact‑match verification.
Key Components
Parallel computation of multiple tokens adds negligible extra latency compared with generating a single token.
The efficiency and accuracy of the draft (speculation) stage.
The verification strategy that balances quality and speed.
Drafting Strategies
Independent Drafting
Use a smaller model from the same family as the target (e.g., OPT‑125M drafts for OPT‑70B, T5‑small drafts for T5‑XXL). This requires no extra training and benefits from shared architecture, tokenization, and data distribution, which improves behavior alignment. Knowledge distillation can further align the small draft model with the target, increasing the proportion of tokens that pass verification.
Self‑Drafting
When a suitable external small model is unavailable, the target LLM can generate drafts itself by adding extra feed‑forward heads on the final decoder layer (e.g., Blockwise Decoding, Medusa). These heads enable parallel generation of multiple tokens per step. Alternative approaches include early‑existing layers, layer‑skipping, or inserting multiple [PAD] tokens to create parallel draft streams. All self‑drafting methods require additional training of the extra heads.
Verification Strategies
During verification the draft tokens are fed as a prefix to the target LLM. If the target generates the same token as the draft, the token is accepted; the first mismatch aborts the remaining draft tokens because the prefix assumption is broken. Exact‑match verification guarantees identical results to greedy decoding but may discard high‑quality tokens that differ from the top‑1 choice, reducing speed‑up. Several works relax the verification criterion (e.g., accepting tokens within a probability margin) to increase the number of accepted tokens while maintaining acceptable quality.
Speculative decoding also supports nucleus sampling and can verify multiple draft sequences in parallel, further increasing throughput.
Challenges and Future Directions
Key open problems include improving behavior alignment between draft and target models (e.g., more effective knowledge distillation), designing task‑specific strategies for multimodal models, and reducing the training and deployment overhead of self‑drafting mechanisms.
References
Fast Transformer Decoding: One Write‑Head is All You Need – https://arxiv.org/abs/1911.02150
Blockwise Parallel Decoding for Deep Autoregressive Models – https://arxiv.org/pdf/1811.03115.pdf
Fast Inference from Transformers via Speculative Decoding – https://arxiv.org/abs/2211.17192
Assisted Generation – https://huggingface.co/blog/assisted-generation
Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads – https://github.com/FasterDecoding/Medusa
Lookahead Decoding – https://lmsys.org/blog/2023-11-21-lookahead-decoding/
DistillSpec: Improving Speculative Decoding via Knowledge Distillation – https://arxiv.org/abs/2310.08461
Predictive Pipelined Decoding: A Compute‑Latency Trade‑off for Exact LLM Decoding – https://arxiv.org/abs/2307.05908
Draft & Verify: Lossless Large Language Model Acceleration via Self‑Speculative Decoding – https://arxiv.org/abs/2309.08168
SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification – https://arxiv.org/abs/2305.09781
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
