How Speculative Decoding Supercharges Large Language Model Inference

This survey examines speculative decoding—a draft‑then‑verify technique that parallelizes token generation to cut LLM inference latency, outlines its core components, compares independent and self‑drafting methods, discusses verification strategies, and highlights open research challenges.

NewBeeNLP
NewBeeNLP
NewBeeNLP
How Speculative Decoding Supercharges Large Language Model Inference

Background

Large language models (LLMs) increasingly suffer from inference latency that is dominated by memory‑bandwidth limits: each decoding step repeatedly moves billions of parameters between GPU high‑bandwidth memory and cache. Autoregressive decoding generates a single token per step, leading to low GPU utilization and a generation time that grows linearly with sequence length.

Autoregressive vs. speculative decoding
Autoregressive vs. speculative decoding

Speculative Decoding

Speculative decoding follows a draft‑then‑verify paradigm. At each step the draft model predicts several future tokens; the target LLM then validates them in parallel. Tokens that match the target’s predictions are emitted, reducing the total number of decoding steps while guaranteeing that the final output is identical to that of the target LLM under exact‑match verification.

Key Components

Parallel computation of multiple tokens adds negligible extra latency compared with generating a single token.

The efficiency and accuracy of the draft (speculation) stage.

The verification strategy that balances quality and speed.

Drafting Strategies

Independent Drafting

Use a smaller model from the same family as the target (e.g., OPT‑125M drafts for OPT‑70B, T5‑small drafts for T5‑XXL). This requires no extra training and benefits from shared architecture, tokenization, and data distribution, which improves behavior alignment. Knowledge distillation can further align the small draft model with the target, increasing the proportion of tokens that pass verification.

Independent drafting example
Independent drafting example

Self‑Drafting

When a suitable external small model is unavailable, the target LLM can generate drafts itself by adding extra feed‑forward heads on the final decoder layer (e.g., Blockwise Decoding, Medusa). These heads enable parallel generation of multiple tokens per step. Alternative approaches include early‑existing layers, layer‑skipping, or inserting multiple [PAD] tokens to create parallel draft streams. All self‑drafting methods require additional training of the extra heads.

Medusa architecture
Medusa architecture

Verification Strategies

During verification the draft tokens are fed as a prefix to the target LLM. If the target generates the same token as the draft, the token is accepted; the first mismatch aborts the remaining draft tokens because the prefix assumption is broken. Exact‑match verification guarantees identical results to greedy decoding but may discard high‑quality tokens that differ from the top‑1 choice, reducing speed‑up. Several works relax the verification criterion (e.g., accepting tokens within a probability margin) to increase the number of accepted tokens while maintaining acceptable quality.

Speculative decoding also supports nucleus sampling and can verify multiple draft sequences in parallel, further increasing throughput.

Challenges and Future Directions

Key open problems include improving behavior alignment between draft and target models (e.g., more effective knowledge distillation), designing task‑specific strategies for multimodal models, and reducing the training and deployment overhead of self‑drafting mechanisms.

References

Fast Transformer Decoding: One Write‑Head is All You Need – https://arxiv.org/abs/1911.02150

Blockwise Parallel Decoding for Deep Autoregressive Models – https://arxiv.org/pdf/1811.03115.pdf

Fast Inference from Transformers via Speculative Decoding – https://arxiv.org/abs/2211.17192

Assisted Generation – https://huggingface.co/blog/assisted-generation

Medusa: Simple Framework for Accelerating LLM Generation with Multiple Decoding Heads – https://github.com/FasterDecoding/Medusa

Lookahead Decoding – https://lmsys.org/blog/2023-11-21-lookahead-decoding/

DistillSpec: Improving Speculative Decoding via Knowledge Distillation – https://arxiv.org/abs/2310.08461

Predictive Pipelined Decoding: A Compute‑Latency Trade‑off for Exact LLM Decoding – https://arxiv.org/abs/2307.05908

Draft & Verify: Lossless Large Language Model Acceleration via Self‑Speculative Decoding – https://arxiv.org/abs/2309.08168

SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification – https://arxiv.org/abs/2305.09781

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance Optimizationartificial intelligencespeculative decodingLLM inferenceParallelism
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.