Artificial Intelligence 14 min read

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

This guide reviews the Qwen3.5 model lineup, explains mixed‑inference and MoE architecture, presents benchmark comparisons with GPT‑5.2, Claude 4.5 and Gemini‑3 Pro, evaluates 4‑bit and 3‑bit quantization loss, outlines hardware requirements, and provides step‑by‑step deployment options using llama.cpp or llama‑server.

Old Zhang's AI Learning

Feb 26, 2026

Ultimate Guide to Local Deployment of Qwen3.5 Models (27B‑397B)

Qwen3.5 Model Lineup

The latest Qwen3.5 family from Alibaba includes four variants: Dense 27B, MoE 35B‑A3B (3B active), MoE 122B‑A10B (10B active) and flagship MoE 397B‑A17B (17B active). Each is positioned from "steady" to "flagship" based on parameter count and activation.

Key Technologies: Mixed Inference and MoE

Mixed inference enables two modes – thinking for deep reasoning and non‑thinking for fast dialogue. The MoE (Mixture‑of‑Experts) architecture activates only a fraction of the total parameters (e.g., 17B of 397B), delivering high throughput while using far less VRAM than a dense model of comparable performance.

Benchmark Performance

On a suite of hard‑core benchmarks Qwen3.5‑397B matches or exceeds leading closed‑source models. Scores include MMLU‑Pro 87.8, GPQA Diamond 88.4, AIME26 91.3, SWE‑bench 76.4, TAU2‑Bench 86.7, IFBench 76.5 and BrowseComp 78.6, surpassing GPT‑5.2 and Claude Opus 4.5 on several tasks, especially multi‑language and agent‑oriented evaluations.

Quantization Accuracy

Third‑party testing by Benjamin Marie on 750 mixed tasks shows that the 4‑bit UD‑Q4_K_XL version loses only 0.8 percentage points (accuracy 80.5 % vs 81.3 % FP16) and the 3‑bit UD‑Q3_K_XL loses 0.6 points (80.7 %). Storage drops from ~807 GB (FP16) to ~214 GB (4‑bit) or ~160 GB (3‑bit), delivering roughly 99 % of the original performance with a quarter of the space.

Hardware Requirements

Qwen3.5‑27B: 4‑bit 17 GB, 8‑bit 30 GB, FP16 54 GB

Qwen3.5‑35B‑A3B: 4‑bit 22 GB, 8‑bit 38 GB, FP16 70 GB

Qwen3.5‑122B‑A10B: 4‑bit 70 GB, 8‑bit 132 GB, FP16 245 GB

Qwen3.5‑397B‑A17B: 4‑bit 214 GB, 8‑bit 512 GB, FP16 810 GB

Typical recommendations: a 24 GB GPU (e.g., RTX 4090) can run 27B or 35B‑A3B; a 70 GB unified‑memory Mac can handle 122B‑A10B; a 256 GB‑plus workstation is needed for the 397B flagship.

Model Selection Advice

Choose 27B for highest accuracy, 35B‑A3B for fastest inference (MoE activates only 3 B), 122B‑A10B for a sweet‑spot on memory‑constrained machines, and 397B‑A17B for top‑tier performance when ample RAM is available.

Minimal‑Effort Deployment Options

Option 1: Compile and Run with llama.cpp (recommended)

# Install dependencies and compile (CUDA=ON for GPU, OFF for CPU)
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

# Thinking mode (accurate coding)
export LLAMA_CACHE="unsloth/Qwen3.5-35B-A3B-GGUF"
./llama.cpp/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
  --ctx-size 16384 --temp 0.6 --top-p 0.95 --top-k 20 --min-p 0.00

# General tasks (creative)
export LLAMA_CACHE="unsloth/Qwen3.5-35B-A3B-GGUF"
./llama.cpp/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
  --ctx-size 16384 --temp 1.0 --top-p 0.95 --top-k 20 --min-p 0.00

# Non‑thinking (fast response)
export LLAMA_CACHE="unsloth/Qwen3.5-35B-A3B-GGUF"
./llama.cpp/llama-cli -hf unsloth/Qwen3.5-35B-A3B-GGUF:MXFP4_MOE \
  --ctx-size 16384 --temp 0.7 --top-p 0.8 --top-k 20 --min-p 0.00 \
  --chat-template-kwargs "{\"enable_thinking\": false}"

Option 2: Download First, Then Run

# Install download tools
pip install huggingface_hub hf_transfer
# Download 4‑bit Dynamic MXFP4_MOE (~22 GB)
hf download unsloth/Qwen3.5-35B-A3B-GGUF \
  --local-dir unsloth/Qwen3.5-35B-A3B-GGUF \
  --include "*MXFP4_MOE*"
# Run the model
./llama.cpp/llama-cli \
  --model unsloth/Qwen3.5-35B-A3B-GGUF/Qwen3.5-35B-A3B-MXFP4_MOE.gguf \
  --seed 3407 --temp 1.0 --top-p 0.95 --min-p 0.01 --top-k 40

Option 3: Deploy as an OpenAI‑compatible API Service (production)

# Start llama‑server (example with 397B)
./llama.cpp/llama-server \
  --model unsloth/Qwen3.5-397B-A17B-GGUF/MXFP4_MOE/Qwen3.5-397B-A17B-MXFP4_MOE-00001-of-00006.gguf \
  --alias "unsloth/Qwen3.5-397B-A17B" \
  --temp 0.6 --top-p 0.95 --ctx-size 16384 --top-k 20 --min-p 0.00 \
  --port 8001

from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8001/v1", api_key="sk-no-key-required")
completion = client.chat.completions.create(
    model="unsloth/Qwen3.5-397B-A17B",
    messages=[{"role": "user", "content": "Create a Snake game."}]
)
print(completion.choices[0].message.content)

⚠️ The maximum context length is 262,144 tokens; a practical output length is 32,768 tokens. For a 24 GB GPU, keep --ctx-size 16384 .

Inference Parameter Settings

Thinking Mode (deep reasoning)

temperature: 0.6 (accurate coding) / 1.0 (general)

top_p: 0.95

top_k: 20

min_p: 0.0

presence_penalty: 0.0 (coding) / 1.5 (general)

Non‑Thinking Mode (fast response)

temperature: 0.7 (general) / 1.0 (reasoning)

top_p: 0.8 (general) / 0.95 (reasoning)

top_k: 20

min_p: 0.0

presence_penalty: 1.5

Tool Calling and Local Agents

Qwen3.5 natively supports function calling, allowing the model to invoke Python scripts, run terminal commands, or query databases when used with llama-server. Compared with Ollama, the llama‑server + OpenAI SDK stack offers greater flexibility for production‑grade agents.

Pros and Cons

✅ MoE architecture activates only 17 B of 397 B parameters, delivering high inference efficiency.

✅ Unsloth Dynamic 2.0 quantization loses less than 1 % accuracy while cutting storage to a quarter.

✅ 256 K context window, support for 201 languages, and full multimodal coverage.

✅ Mixed‑inference mode lets users switch between deep reasoning and fast dialogue.

✅ Outperforms GPT‑5.2, Claude Opus 4.5 on multiple benchmarks.

✅ Fully open‑source, supporting local deployment and fine‑tuning.

⚠️ Flagship 397 B requires 192 GB+ RAM, limiting accessibility.

⚠️ Still lags behind GPT‑5.2 on pure math reasoning and code‑generation benchmarks.

⚠️ MoE may be slower than dense models on pure‑CPU inference scenarios.

Who Should Use This Guide

Mac users with high‑end unified memory, Linux users with 24 GB+ GPUs, and enterprises needing on‑premise LLM deployment.

Quantization Large Language Model MoE Local Deployment Inference llama.cpp Qwen3.5

Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Qwen3.5 Model Lineup

Key Technologies: Mixed Inference and MoE

Benchmark Performance

Quantization Accuracy

Hardware Requirements

Model Selection Advice

Minimal‑Effort Deployment Options

Option 1: Compile and Run with llama.cpp (recommended)

Option 2: Download First, Then Run

Option 3: Deploy as an OpenAI‑compatible API Service (production)

Inference Parameter Settings

Thinking Mode (deep reasoning)

Non‑Thinking Mode (fast response)

Tool Calling and Local Agents

Pros and Cons

Who Should Use This Guide

Old Zhang's AI Learning

How this landed with the community

Was this worth your time?

0 Comments

Option 1: Compile and Run with llama.cpp (recommended)

Option 2: Download First, Then Run

Option 3: Deploy as an OpenAI‑compatible API Service (production)