How Unsloth’s MTP Boosts Qwen3.6 Inference Speed on Consumer GPUs
Unsloth adds MTP to Qwen3.6‑27B and 35B‑A3B models, delivering 1.5‑2× decoding speed gains on consumer‑grade GPUs, with ~80% draft acceptance, while providing installation steps, usage parameters, benchmark results, and guidance on suitable scenarios.
Introduction
Unsloth released Qwen3.6‑27B‑MTP‑GGUF and Qwen3.6‑35B‑A3B‑MTP‑GGUF. MTP (Multi‑Token Prediction) trains the model to predict several future tokens, uses those predictions as a draft during inference, and validates them with the main model, eliminating the need for a separate draft model and saving memory.
Core Highlights
Decoding speed improves ~1.5‑2× (unsloth’s reported numbers).
Draft acceptance rate around 80%.
Prefill stage incurs a modest overhead, especially for long contexts.
Supports two model sizes: 27B dense and 35B‑A3B (256 experts, 8+1 active).
Installation
Prerequisite: use the mtp-clean branch of llama.cpp (PR #22673 at the time of writing).
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone -b mtp-clean https://github.com/am17an/llama.cpp.git
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cppCPU / Mac Metal users should replace -DGGML_CUDA=ON with -DGGML_CUDA=OFF.
Usage
Run the 27B version (recommended configuration):
export LLAMA_CACHE="unsloth/Qwen3.6-27B-GGUF-MTP"
./llama.cpp/llama-server \
-hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
-ngl 99 -c 8192 -fa on -np 1 \
--spec-type mtp --spec-draft-n-max 3Run the 35B‑A3B (MoE) version:
export LLAMA_CACHE="unsloth/Qwen3.6-35B-A3B-GGUF-MTP"
./llama.cpp/llama-server \
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
-ngl 99 -c 8192 -fa on -np 1 \
--spec-type mtp --spec-draft-n-max 3Key parameters: --spec-type mtp: enables MTP speculative decoding. --spec-draft-n-max 3: allows up to three token drafts per step; beyond that the marginal gain diminishes.
Known limitations: -np > 1 (parallel slots) is not supported yet. --mmproj (multimodal) is not supported yet.
Consequently, MTP is currently best for single‑user, local, pure‑text scenarios.
Benchmarks
Community test on a single RTX 5090 with Qwen3.6‑27B quantized to Q4_0 (KV cache also Q4_0) and prompt “write a flappy bird clone”.
With MTP enabled:
prompt eval: 253.34 tok/s
decode eval: 105.47 tok/s
draft acceptance rate: 79.7% (4169 / 5229)
total: 5929 tokens / 56.1 sWithout MTP (same model and config):
prompt eval: 174.20 tok/s
decode eval: 63.72 tok/s
total: 6587 tokens / 103.2 sDecoding speed increased by 65%, and draft acceptance approached 80%, confirming that the MTP head is well‑trained. Prefill overhead is modest (≈10%) and may grow with very long prompts.
Practical Opinions
For single‑user dialogue or code generation, the speed boost is essentially free.
For long‑document summarization or RAG where prompts reach tens of thousands of tokens, prefill cost becomes significant and should be weighed.
Even the 35B‑A3B MoE model activates only ~3B parameters; after 4‑bit quantization it fits in ~20 GB, allowing single‑card deployment on 24 GB GPUs.
Why Unsloth’s Release Matters
Previously, using GGUF involved only quantization and execution. This time Unsloth also quantized the MTP head, preserving it in the GGUF file and adapting llama.cpp kernels, so the entire speculative decoding pipeline works with a single command‑line flag, without requiring users to understand EAGLE, Medusa, or Lookahead.
Conclusion
If you run Qwen3.6 locally on a 24 GB+ GPU for single‑user chat or code tasks, the MTP‑enabled GGUF provides a noticeable 65% decoding speed improvement and is a straightforward upgrade. For multi‑user servers, long‑context RAG, or multimodal workloads, wait for upstream PRs that add concurrency and mmproj support.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
