Boost Qwen3.6 with MTP: 1.5× Faster Local Deployment for Claude Code

The article explains how to enable Multi‑Token Prediction (MTP) in Qwen3.6 using a specific llama.cpp PR, achieving up to 1.5× faster local inference, details compilation steps, optimal parameters, memory requirements, and how to integrate the accelerated model with Claude Code while avoiding common pitfalls.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Boost Qwen3.6 with MTP: 1.5× Faster Local Deployment for Claude Code

Yesterday Daniel Han (UnslothAI founder) completed the Qwen3.6+ MTP guide, covering the required PR branch, compilation steps, parameter settings, and official benchmarks.

Qwen3.6 27B can reach 140 tokens/s , 35B‑A3B 220 tokens/s , >1.4× speedup with unchanged accuracy.

MTP: What it is and why it’s faster

Simple: MTP (Multi Token Prediction) is a draft‑free speculative decoding built into Qwen3.6.

Ordinary speculative decoding needs a separate draft model. Qwen3.6 embeds an MTP head that predicts several tokens at once, validates them in parallel, and emits high‑acceptance tokens, reducing forward passes.

Model predicts several future tokens in one step.

Main model verifies those tokens in parallel.

High‑acceptance tokens are output directly, saving forward passes.

Benchmarks:

Dense 27B model with draft tokens = 2 achieves ~1.4× speedup.

MoE 35B‑A3B model achieves 1.15–1.2× speedup.

Official acceleration curve and throughput comparison are shown in the images below.

Qwen3.6 MTP acceleration curve
Qwen3.6 MTP acceleration curve
Qwen3.6 MTP throughput comparison
Qwen3.6 MTP throughput comparison

Why not increase draft tokens beyond 2? The --spec-draft-n-max parameter caps at 2 because acceptance drops from 83% to 50% when draft tokens increase to 4, erasing the speed benefit.

Don’t be greedy.

Compilation: Use a specific llama.cpp PR branch

The easiest pitfall is trying to build with the master branch of llama.cpp. MTP support is still being merged, so you must use Aman’s PR branch (ggml‑org/llama.cpp#22673).

Full compile command for Linux/WSL (CUDA enabled):

apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone -b mtp-clean https://github.com/am17an/llama.cpp.git
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp

For macOS/Metal, set -DGGML_CUDA=OFF; Metal is enabled by default.

⚠️ Do not use CUDA 13.2 – NVIDIA confirms a bug that produces garbled output.

Running: Commands for 27B and 35B‑A3B

27B MTP (thinking mode, general tasks):

export LLAMA_CACHE="unsloth/Qwen3.6-27B-MTP-GGUF"
./llama.cpp/llama-cli \
    -hf unsloth/Qwen3.6-27B-GGUF:UD-Q4_K_XL \
    --temp 1.0 --top-p 0.95 --top-k 20 \
    --presence-penalty 1.5 --min-p 0.00 \
    --spec-type mtp --spec-draft-n-max 2

35B‑A3B MTP (non‑thinking mode, server for OpenAI‑compatible clients):

export LLAMA_CACHE="unsloth/Qwen3.6-35B-A3B-MTP-GGUF"
./llama.cpp/llama-server \
    -hf unsloth/Qwen3.6-35B-A3B-MTP-GGUF:UD-Q4_K_XL \
    --temp 0.7 --top-p 0.8 --top-k 20 \
    --presence-penalty 1.5 --min-p 0.00 \
    --spec-type mtp --spec-draft-n-max 2 \
    --chat-template-kwargs '{"enable_thinking":false}'

Important parameters:

thinking mode : temperature=1.0, top_p=0.95, presence_penalty=1.5 non‑thinking mode : temperature=0.7, top_p=0.8, presence_penalty=1.5 For precise coding tasks use temperature=0.6/1.0 and presence_penalty=0.0 Disable thinking with --chat-template-kwargs '{"enable_thinking":false}' Garbage output often means context length is too short; try

--cache-type-k bf16 --cache-type-v bf16

Memory requirements

Memory (VRAM + system RAM) needed for each quantization level:

Qwen3.6‑27B: 2‑bit 15 GB, 4‑bit 18 GB, 8‑bit 30 GB, BF16 55 GB.

Qwen3.6‑35B‑A3B: 2‑bit 17 GB, 4‑bit 23 GB, 8‑bit 38 GB, BF16 70 GB.

If memory is insufficient, llama.cpp can offload to SSD/HDD, though with slower performance.

⚠️ Ollama currently cannot run Qwen3.6 GGUF (visual mmproj files are separate); use the llama.cpp route instead.

One more thing: Driving Claude Code with local Qwen3.6

Running without an agent wastes potential. Unsloth’s docs provide a full pipeline; key steps are highlighted below.

Step 1: Install Claude Code

curl -fsSL https://claude.ai/install.sh | bash
cd ~/projects/my-project
claude

Step 2: Start local llama‑server (using the 35B‑A3B command above)

./llama.cpp/llama-server \
  --model unsloth/Qwen3.6-35B-A3B-GGUF/Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf \
  --alias "unsloth/Qwen3.6-35B-A3B" \
  --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.0 \
  --ctx-size 16384 --port 8001

Step 3: Disable the Claude Code Attribution Header

Claude Code recently adds an Attribution Header that invalidates the KV cache, slowing local inference by ~90%.

Setting export CLAUDE_CODE_ATTRIBUTION_HEADER=0 does not work; you must edit ~/.claude/settings.json and add:

{
  "env": {
    "CLAUDE_CODE_ATTRIBUTION_HEADER": "0"
  }
}

Step 4: Point Claude Code to the local server

export ANTHROPIC_BASE_URL="http://127.0.0.1:8001"
export ANTHROPIC_AUTH_TOKEN="sk-no-key-required"
export ANTHROPIC_MODEL="unsloth/Qwen3.6-35B-A3B"

Now claude can run agents locally. Users who prefer a graphical UI can follow Unsloth’s API guide, which runs on 24 GB GPUs.

Summary

MTP is not a new model; it is a draft‑free speculative decoding head built into Qwen3.6.

Use --spec-draft-n-max 2; larger values slow down inference.

Compile with Aman’s PR branch; the master branch lacks MTP support.

Avoid CUDA 13.2 due to known bugs.

When using Claude Code, disable the Attribution Header in ~/.claude/settings.json to keep the speed gains.

Who should use this

Local users with 24 GB GPUs: 27B MTP + Q4 quantization is a sweet spot.

Small studios or private deployments: 35B‑A3B server works well with Claude Code for daily coding agents.

Anyone dissatisfied with Ollama’s limitations: use llama.cpp + MTP directly.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

local deploymentMTPllama.cppClaude CodeLLM accelerationQwen3.6
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.