Red Hat AI Brings DSpark Speculative Decoding to GLM‑5.2, Doubling Speed

Red Hat AI released a DSpark speculative decoding model for GLM‑5.2, showing how a 3B draft model and a Markov logit‑bias head can boost token acceptance length and achieve up to 2.15× faster decoding on a 4×B300 GPU setup.

AI Engineering
AI Engineering
AI Engineering
Red Hat AI Brings DSpark Speculative Decoding to GLM‑5.2, Doubling Speed

DSpark is built on a DFlash parallel draft backbone, augmented with a Markov logit‑bias head and a per‑position confidence head, allowing a small 3 B draft model to quickly generate multiple candidate tokens while the large GLM‑5.2‑FP8 model validates a batch, eliminating the need for serial token‑by‑token decoding.

The project was trained in two stages: a preview checkpoint (3 epochs on 50 k UltraChat data) followed by a full epoch‑1 checkpoint that used a full‑vocabulary draft (154 880 tokens) and Magpie + UltraChat regenerated data, which raised the mean accepted length.

During epoch‑1 training the mean accepted length stabilized around 3.4 tokens, and per‑position acceptance dropped smoothly from 78 % at position 1 to 38 % at position 7, indicating the Markov head effectively suppresses suffix decay. On a 4×B300 GPU, the base model without speculative decoding runs at 102 tok/s; the preview checkpoint reaches 139 tok/s (1.36×), and the epoch‑1 checkpoint achieves 219 tok/s (2.15×). The average accepted length increased from 2.18 to 3.49 tokens, meaning each forward pass of the large model confirms 3.49 tokens instead of just one.

To use the model you need the vLLM nightly build:

uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly

vllm serve zai-org/GLM-5.2-FP8 \
    --tensor-parallel-size 4 \
    --max-model-len 16384 \
    --trust-remote-code \
    --speculative-config '{
        "model": "RedHatAI/GLM-5.2-speculator.dspark",
        "num_speculative_tokens": 7,
        "method": "dspark",
        "draft_sample_method": "probabilistic"
    }'

Note that the

--speculative-config
model

path points to a Hugging Face repository; to try the preview checkpoint replace the model name with RedHatAI/GLM-5.2-speculator.dspark-preview.

Speculative decoding has evolved from an academic toy to a practical tool, and DSpark, previously tied only to DeepSeek, now demonstrates that the architecture is generic—any large model can gain roughly 2× speed if a well‑trained draft model is available. Training data generated via self‑play lowers the data‑preparation barrier.

Epoch‑2 and epoch‑3 checkpoints are still in training, and the trend suggests the accepted length could exceed 4 tokens, offering the most cost‑effective optimization for teams running GLM‑5.2 inference: adding a 3 B draft model can halve GPU time without modifying the base model. Both the model and code are released under the MIT license, and the training pipeline based on the speculators library is open‑sourced on GitHub.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsspeculative decodingvllmGLM-5.2DSpark
AI Engineering
Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.