How Unsloth’s MTP Boosts Qwen3.6 Inference Speed on Consumer GPUs

Unsloth adds MTP to Qwen3.6‑27B and 35B‑A3B models, delivering 1.5‑2× decoding speed gains on consumer‑grade GPUs, with ~80% draft acceptance, while providing installation steps, usage parameters, benchmark results, and guidance on suitable scenarios.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
How Unsloth’s MTP Boosts Qwen3.6 Inference Speed on Consumer GPUs

Introduction

Unsloth released Qwen3.6‑27B‑MTP‑GGUF and Qwen3.6‑35B‑A3B‑MTP‑GGUF. MTP (Multi‑Token Prediction) trains the model to predict several future tokens, uses those predictions as a draft during inference, and validates them with the main model, eliminating the need for a separate draft model and saving memory.

Core Highlights

Decoding speed improves ~1.5‑2× (unsloth’s reported numbers).

Draft acceptance rate around 80%.

Prefill stage incurs a modest overhead, especially for long contexts.

Supports two model sizes: 27B dense and 35B‑A3B (256 experts, 8+1 active).

Installation

Prerequisite: use the mtp-clean branch of llama.cpp (PR #22673 at the time of writing).

apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y

git clone -b mtp-clean https://github.com/am17an/llama.cpp.git
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp

CPU / Mac Metal users should replace -DGGML_CUDA=ON with -DGGML_CUDA=OFF.

Usage

Run the 27B version (recommended configuration):

export LLAMA_CACHE="unsloth/Qwen3.6-27B-GGUF-MTP"
./llama.cpp/llama-server \
    -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 -c 8192 -fa on -np 1 \
    --spec-type mtp --spec-draft-n-max 3

Run the 35B‑A3B (MoE) version:

export LLAMA_CACHE="unsloth/Qwen3.6-35B-A3B-GGUF-MTP"
./llama.cpp/llama-server \
    -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
    -ngl 99 -c 8192 -fa on -np 1 \
    --spec-type mtp --spec-draft-n-max 3

Key parameters: --spec-type mtp: enables MTP speculative decoding. --spec-draft-n-max 3: allows up to three token drafts per step; beyond that the marginal gain diminishes.

Known limitations: -np > 1 (parallel slots) is not supported yet. --mmproj (multimodal) is not supported yet.

Consequently, MTP is currently best for single‑user, local, pure‑text scenarios.

Benchmarks

Community test on a single RTX 5090 with Qwen3.6‑27B quantized to Q4_0 (KV cache also Q4_0) and prompt “write a flappy bird clone”.

With MTP enabled:

prompt eval: 253.34 tok/s
decode eval: 105.47 tok/s
draft acceptance rate: 79.7% (4169 / 5229)
total: 5929 tokens / 56.1 s

Without MTP (same model and config):

prompt eval: 174.20 tok/s
decode eval: 63.72 tok/s
total: 6587 tokens / 103.2 s

Decoding speed increased by 65%, and draft acceptance approached 80%, confirming that the MTP head is well‑trained. Prefill overhead is modest (≈10%) and may grow with very long prompts.

Practical Opinions

For single‑user dialogue or code generation, the speed boost is essentially free.

For long‑document summarization or RAG where prompts reach tens of thousands of tokens, prefill cost becomes significant and should be weighed.

Even the 35B‑A3B MoE model activates only ~3B parameters; after 4‑bit quantization it fits in ~20 GB, allowing single‑card deployment on 24 GB GPUs.

Why Unsloth’s Release Matters

Previously, using GGUF involved only quantization and execution. This time Unsloth also quantized the MTP head, preserving it in the GGUF file and adapting llama.cpp kernels, so the entire speculative decoding pipeline works with a single command‑line flag, without requiring users to understand EAGLE, Medusa, or Lookahead.

Conclusion

If you run Qwen3.6 locally on a 24 GB+ GPU for single‑user chat or code tasks, the MTP‑enabled GGUF provides a noticeable 65% decoding speed improvement and is a straightforward upgrade. For multi‑user servers, long‑context RAG, or multimodal workloads, wait for upstream PRs that add concurrency and mmproj support.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

speculative decodingGPUMTPllama.cpplocal inferenceGGUFQwen3.6
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.