Speculative Decoding Explained: Small Draft Model + One‑Shot Verification
The article details how speculative decoding—using a fast small model to draft tokens and a large model to verify them—overcomes the memory‑bandwidth bottleneck of autoregressive inference, introduces SSD’s self‑draft and tree‑verification stages, presents real‑world benchmark gains, and shows how to enable it in vLLM.
What Is Speculative Decoding?
Speculative decoding starts with a tiny, fast draft model that predicts the next K tokens. The large target model then validates all K tokens in a single forward pass. Tokens whose draft probability matches the target are accepted; mismatches are rejected and resampled, guaranteeing exactly the same output distribution as the target model alone.
Three Core Steps
Step 1 – Draft Generation: A small model (e.g., Llama‑3 8B) generates K candidate tokens autoregressively. It occupies only a fraction of GPU memory and is roughly ten times faster per token than the large model.
Step 2 – Verification: The large model (e.g., Llama‑3 70B) processes all K draft tokens in parallel, computing probability distributions for each position in one forward pass, instead of K separate passes.
Step 3 – Accept/Reject: For each position, the draft and target distributions are compared. If they agree, the token is kept; otherwise the target model resamples and all subsequent draft tokens are discarded. The mathematical guarantee is that the final output distribution is identical to running the target model alone.
Speedup Dependence
The acceleration depends on the acceptance rate. Assuming an 80 % acceptance rate and K = 5, the average number of accepted tokens per verification is 4, reducing the large model’s forward‑pass cost to about one‑quarter. Theoretical speedup ≈ 4×; real‑world measurements show 2–2.5× because of draft‑model overhead.
Speedup = avg_accepted_tokens + 1/1
If acceptance_rate = 0.8 and K = 5:
avg_accepted = 0.8 * 5 = 4
speedup ~ 4x in forward passes
real‑world speedup ~ 2–2.5x (accounting for draft model overhead)SSD: Speculative Speculative Decoding
In 2026 Together AI and Stanford introduced SSD to address two pain points of the original approach:
Draft‑model selection required a separate model with compatible vocabulary, wasting GPU memory.
Acceptance rates drop on difficult tokens where the target model’s distribution is flat.
SSD solves these with two stages:
Stage 1 – Self‑Draft Generation: The first N layers of the target model act as a lightweight draft generator, eliminating the need for an external draft model.
Stage 2 – Tree‑Based Speculative Verification: Instead of a linear K‑token draft, SSD proposes top‑2 or top‑3 candidates at each position, forming a branching tree. A specially designed attention mask lets the target model verify the entire tree in one forward pass, selecting the longest accepted path.
The tree structure improves acceptance because even if one draft token is wrong, other branches may still be correct. Empirical results show higher speedups across tasks.
Benchmark Numbers
Single‑request throughput on an H100 (80 GB) for Llama‑3 70B:
Baseline autoregressive decoding: 125 tokens/s.
With SSD: 250 tokens/s (2.0× improvement).
First‑token latency (TTFT): 180 ms baseline → 195 ms SSD (tree initialization adds ~15 ms).
Memory usage: 68.2 GB baseline → 71.4 GB SSD (increase of 3.2 GB, 4.7 %).
Task‑level acceptance rates and effective speedups:
| Task | Acceptance Rate | Effective Speedup |
|-----------------|-----------------|-------------------|
| Code generation | 87% | 2.3x |
| Translation | 82% | 2.0x |
| Summarization | 79% | 1.9x |
| Creative writing| 68% | 1.6x |
| Math reasoning | 72% | 1.7x |Code‑heavy workloads benefit most because the draft model predicts predictable syntax well; creative writing sees the smallest gains due to higher uncertainty.
At production scale (10 M tokens/day on an H100), the baseline needs four GPUs to keep TTFT < 200 ms, whereas SSD requires only two, saving roughly $150,000 annually.
Implementation in vLLM
vLLM added production‑grade speculative decoding in version 0.7 and SSD‑style self‑draft in 0.8. Example configuration for basic speculative decoding:
from vllm import LLM, SamplingParams
# Initialize with separate draft model
llm = LLM(
model="meta-llama/Llama-3-70B-Instruct",
speculative_model="meta-llama/Llama-3-8B-Instruct",
num_speculative_tokens=5,
tensor_parallel_size=4,
gpu_memory_utilization=0.90,
dtype="float16",
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=2048,
)
prompts = [
"Explain the PagedAttention algorithm in detail.",
"Write a Python function to merge two sorted linked lists.",
"Summarize the key findings of the 2026 State of AI report.",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
generated_text = output.outputs[0].text
num_tokens = len(output.outputs[0].token_ids)
print(f"Generated {num_tokens} tokens")
print(generated_text[:200])
print("---")SSD‑style self‑draft (single model) configuration:
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-3-70B-Instruct",
speculative_model="[self]", # use early layers as draft
speculative_draft_tensor_parallel_size=1,
num_speculative_tokens=5,
speculative_max_model_len=4096,
tensor_parallel_size=4,
gpu_memory_utilization=0.92,
dtype="float16",
)
sampling_params = SamplingParams(
temperature=0.0, # greedy for highest acceptance
max_tokens=2048,
)
prompts = load_prompts_from_queue() # your batch of prompts
outputs = llm.generate(prompts, sampling_params)Running the server is transparent to OpenAI‑compatible clients:
# Start vLLM server
# vllm serve meta-llama/Llama-3-70B-Instruct \
# --speculative-model meta-llama/Llama-3-8B-Instruct \
# --num-speculative-tokens 5 \
# --tensor-parallel-size 4 \
# --port 8000
import openai
client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="meta-llama/Llama-3-70B-Instruct",
messages=[{"role": "user", "content": "Explain speculative decoding."}],
max_tokens=1024,
temperature=0.7,
stream=True,
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)The acceleration happens entirely on the server side; existing client code requires no changes.
When to Use Speculative Decoding
Best suited for batch inference pipelines where the draft and target models belong to the same family (e.g., Llama‑3 8B + Llama‑3 70B). High acceptance rates are achieved with greedy or low‑temperature sampling. Scenarios such as RAG, summarization, and data extraction benefit most, often halving latency and GPU cost.
Avoid speculative decoding when:
Strict sub‑100 ms TTFT is required (tree initialization adds 10–20 ms).
High‑temperature (> 1.0) creative generation leads to low acceptance (< 60 %).
Cross‑family model pairs are used, causing vocabulary mismatch.
Tasks produce very short outputs (< 50 tokens), where overhead outweighs gains.
GPU memory is already saturated; adding draft overhead may cause OOM.
Conclusion
Speculative decoding has moved from research to production. SSD’s self‑draft generation removes the need for a separate draft model, keeping only one set of weights in memory while preserving the full speedup. Enabling it in vLLM can double inference throughput with minimal engineering effort.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DeepHub IMBA
A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
