Red Hat AI Brings DSpark Speculative Decoding to GLM‑5.2, Doubling Speed
Red Hat AI released a DSpark speculative decoding model for GLM‑5.2, showing how a 3B draft model and a Markov logit‑bias head can boost token acceptance length and achieve up to 2.15× faster decoding on a 4×B300 GPU setup.
DSpark is built on a DFlash parallel draft backbone, augmented with a Markov logit‑bias head and a per‑position confidence head, allowing a small 3 B draft model to quickly generate multiple candidate tokens while the large GLM‑5.2‑FP8 model validates a batch, eliminating the need for serial token‑by‑token decoding.
The project was trained in two stages: a preview checkpoint (3 epochs on 50 k UltraChat data) followed by a full epoch‑1 checkpoint that used a full‑vocabulary draft (154 880 tokens) and Magpie + UltraChat regenerated data, which raised the mean accepted length.
During epoch‑1 training the mean accepted length stabilized around 3.4 tokens, and per‑position acceptance dropped smoothly from 78 % at position 1 to 38 % at position 7, indicating the Markov head effectively suppresses suffix decay. On a 4×B300 GPU, the base model without speculative decoding runs at 102 tok/s; the preview checkpoint reaches 139 tok/s (1.36×), and the epoch‑1 checkpoint achieves 219 tok/s (2.15×). The average accepted length increased from 2.18 to 3.49 tokens, meaning each forward pass of the large model confirms 3.49 tokens instead of just one.
To use the model you need the vLLM nightly build:
uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly
vllm serve zai-org/GLM-5.2-FP8 \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--trust-remote-code \
--speculative-config '{
"model": "RedHatAI/GLM-5.2-speculator.dspark",
"num_speculative_tokens": 7,
"method": "dspark",
"draft_sample_method": "probabilistic"
}'Note that the
--speculative-config modelpath points to a Hugging Face repository; to try the preview checkpoint replace the model name with RedHatAI/GLM-5.2-speculator.dspark-preview.
Speculative decoding has evolved from an academic toy to a practical tool, and DSpark, previously tied only to DeepSeek, now demonstrates that the architecture is generic—any large model can gain roughly 2× speed if a well‑trained draft model is available. Training data generated via self‑play lowers the data‑preparation barrier.
Epoch‑2 and epoch‑3 checkpoints are still in training, and the trend suggests the accepted length could exceed 4 tokens, offering the most cost‑effective optimization for teams running GLM‑5.2 inference: adding a 3 B draft model can halve GPU time without modifying the base model. Both the model and code are released under the MIT license, and the training pipeline based on the speculators library is open‑sourced on GitHub.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
