Boost Qwen3.6 with MTP: 1.5× Faster Local Deployment for Claude Code
The article explains how to enable Multi‑Token Prediction (MTP) in Qwen3.6 using a specific llama.cpp PR, achieving up to 1.5× faster local inference, details compilation steps, optimal parameters, memory requirements, and how to integrate the accelerated model with Claude Code while avoiding common pitfalls.
Yesterday Daniel Han (UnslothAI founder) completed the Qwen3.6+ MTP guide, covering the required PR branch, compilation steps, parameter settings, and official benchmarks.
Qwen3.6 27B can reach 140 tokens/s , 35B‑A3B 220 tokens/s , >1.4× speedup with unchanged accuracy.
MTP: What it is and why it’s faster
Simple: MTP (Multi Token Prediction) is a draft‑free speculative decoding built into Qwen3.6.
Ordinary speculative decoding needs a separate draft model. Qwen3.6 embeds an MTP head that predicts several tokens at once, validates them in parallel, and emits high‑acceptance tokens, reducing forward passes.
Model predicts several future tokens in one step.
Main model verifies those tokens in parallel.
High‑acceptance tokens are output directly, saving forward passes.
Benchmarks:
Dense 27B model with draft tokens = 2 achieves ~1.4× speedup.
MoE 35B‑A3B model achieves 1.15–1.2× speedup.
Official acceleration curve and throughput comparison are shown in the images below.
Why not increase draft tokens beyond 2? The --spec-draft-n-max parameter caps at 2 because acceptance drops from 83% to 50% when draft tokens increase to 4, erasing the speed benefit.
Don’t be greedy.
Compilation: Use a specific llama.cpp PR branch
The easiest pitfall is trying to build with the master branch of llama.cpp. MTP support is still being merged, so you must use Aman’s PR branch (ggml‑org/llama.cpp#22673).
Full compile command for Linux/WSL (CUDA enabled):
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone -b mtp-clean https://github.com/am17an/llama.cpp.git
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cppFor macOS/Metal, set -DGGML_CUDA=OFF; Metal is enabled by default.
⚠️ Do not use CUDA 13.2 – NVIDIA confirms a bug that produces garbled output.
Running: Commands for 27B and 35B‑A3B
27B MTP (thinking mode, general tasks):
export LLAMA_CACHE="unsloth/Qwen3.6-27B-MTP-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \
--temp 1.0 --top-p 0.95 --top-k 20 \
--presence-penalty 1.5 --min-p 0.00 \
--spec-type mtp --spec-draft-n-max 235B‑A3B MTP (non‑thinking mode, server for OpenAI‑compatible clients):
export LLAMA_CACHE="unsloth/Qwen3.6-35B-A3B-MTP-GGUF"
./llama.cpp/llama-server \
-hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
--temp 0.7 --top-p 0.8 --top-k 20 \
--presence-penalty 1.5 --min-p 0.00 \
--spec-type mtp --spec-draft-n-max 2 \
--chat-template-kwargs '{"enable_thinking":false}'Important parameters:
thinking mode : temperature=1.0, top_p=0.95, presence_penalty=1.5 non‑thinking mode : temperature=0.7, top_p=0.8, presence_penalty=1.5 For precise coding tasks use temperature=0.6/1.0 and presence_penalty=0.0 Disable thinking with --chat-template-kwargs '{"enable_thinking":false}' Garbage output often means context length is too short; try
--cache-type-k bf16 --cache-type-v bf16Memory requirements
Memory (VRAM + system RAM) needed for each quantization level:
Qwen3.6‑27B: 2‑bit 15 GB, 4‑bit 18 GB, 8‑bit 30 GB, BF16 55 GB.
Qwen3.6‑35B‑A3B: 2‑bit 17 GB, 4‑bit 23 GB, 8‑bit 38 GB, BF16 70 GB.
If memory is insufficient, llama.cpp can offload to SSD/HDD, though with slower performance.
⚠️ Ollama currently cannot run Qwen3.6 GGUF (visual mmproj files are separate); use the llama.cpp route instead.
One more thing: Driving Claude Code with local Qwen3.6
Running without an agent wastes potential. Unsloth’s docs provide a full pipeline; key steps are highlighted below.
Step 1: Install Claude Code
curl -fsSL https://claude.ai/install.sh | bash
cd ~/projects/my-project
claudeStep 2: Start local llama‑server (using the 35B‑A3B command above)
./llama.cpp/llama-server \
--model unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
--alias "unsloth/Qwen3.6-35B-A3B" \
--temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
--ctx-size 16384 --port 8001Step 3: Disable the Claude Code Attribution Header
Claude Code recently adds an Attribution Header that invalidates the KV cache, slowing local inference by ~90%.
Setting export CLAUDE_CODE_ATTRIBUTION_HEADER=0 does not work; you must edit ~/.claude/settings.json and add:
{
"env": {
"CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
}
}Step 4: Point Claude Code to the local server
export ANTHROPIC_BASE_URL="http://127.0.0.1:8001"
export ANTHROPIC_AUTH_TOKEN="sk-no-key-required"
export ANTHROPIC_MODEL="unsloth/Qwen3.6-35B-A3B"Now claude can run agents locally. Users who prefer a graphical UI can follow Unsloth’s API guide, which runs on 24 GB GPUs.
Summary
MTP is not a new model; it is a draft‑free speculative decoding head built into Qwen3.6.
Use --spec-draft-n-max 2; larger values slow down inference.
Compile with Aman’s PR branch; the master branch lacks MTP support.
Avoid CUDA 13.2 due to known bugs.
When using Claude Code, disable the Attribution Header in ~/.claude/settings.json to keep the speed gains.
Who should use this
Local users with 24 GB GPUs: 27B MTP + Q4 quantization is a sweet spot.
Small studios or private deployments: 35B‑A3B server works well with Claude Code for daily coding agents.
Anyone dissatisfied with Ollama’s limitations: use llama.cpp + MTP directly.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
