Running Qwen3.5 Locally: Step‑by‑Step Guide with Unsloth Dynamic Quantization
This article explains how to run the 397B Qwen3.5 model on a Mac by using Unsloth Dynamic 2.0 quantization (2‑bit, 3‑bit, or 4‑bit), outlines hardware requirements, provides compilation and download commands for llama.cpp, shows how to launch inference in thinking and non‑thinking modes, and compares several deployment options such as llama‑server, Transformers, SGLang/vLLM, and MLX.
The 397‑billion‑parameter Qwen3.5 model normally needs about 807 GB of storage and 800 GB of GPU memory, but Unsloth’s Dynamic 2.0 quantization lets a 192 GB‑RAM Mac run a 3‑bit version and a 256 GB‑RAM Mac run a 4‑bit version.
Dynamic 2.0 follows the AngelSlim idea of layer‑wise precision: critical layers are promoted to 8‑bit or 16‑bit while less important layers stay at lower bits, so a 4‑bit overall quantization still preserves inference quality.
Resource requirements per quantization version:
BF16/FP16 (original): ~807 GB disk, ≥800 GB VRAM – impractical for local use.
8‑bit: ~400 GB disk, 512 GB RAM/VRAM – needs a high‑end server.
4‑bit (MXFP4/UD‑Q4_K_XL): ~214 GB disk, 256 GB RAM – fits a 256 GB M3 Ultra Mac.
3‑bit: ~150 GB disk, 192 GB RAM – fits a 192 GB Mac Studio.
2‑bit (UD‑Q2_K_XL): ~100 GB disk, 128 GB RAM – the minimum recommended configuration.
Unsloth officially recommends at least 2‑bit Dynamic quantization as the sweet spot between accuracy and size.
Key practical notes:
VRAM + RAM must be greater than or equal to the model file size for reasonable inference speed.
If memory is insufficient, offloading to SSD is possible but slower.
A single 24 GB GPU plus 256 GB RAM with MoE offloading can achieve >25 tokens/s on the 4‑bit version.
Option 1: Local deployment with llama.cpp
Compile llama.cpp (Ubuntu/Debian example):
# Install dependencies (Ubuntu/Debian)
apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
# Clone repository
git clone https://github.com/ggml-org/llama.cpp
# Build (GPU: -DGGML_CUDA=ON, macOS Metal: -DGGML_METAL=ON, CPU only: -DGGML_CUDA=OFF)
cmake llama.cpp -B llama.cpp/build \
-DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j \
--clean-first \
--target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cppmacOS M‑series users should replace -DGGML_CUDA=ON with -DGGML_METAL=ON to enable Metal GPU acceleration.
Download the GGUF model files (requires huggingface_hub and hf_transfer):
# Install download tools
pip install huggingface_hub hf_transfer
# 4‑bit version (≈214 GB) for a 256 GB Mac
hf download unsloth/Qwen3.5-397B-A17B-GGUF \
--local-dir unsloth/Qwen3.5-397B-A17B-GGUF \
--include "*MXFP4_MOE*"
# 2‑bit version (≈100 GB) for a 128 GB Mac or low‑memory machines
hf download unsloth/Qwen3.5-397B-A17B-GGUF \
--local-dir unsloth/Qwen3.5-397B-A17B-GGUF \
--include "*UD-Q2_K_XL*"Running inference
Qwen3.5 supports two sampling modes:
Thinking mode (complex reasoning, math, coding) – use temp=0.6, top_p=0.95.
Non‑Thinking mode (simple chat) – use temp=0.7, top_p=0.8.
Thinking mode command:
export LLAMA_CACHE="unsloth/Qwen3.5-397B-A17B-GGUF"
./llama.cpp/llama-cli \
-hf unsloth/Qwen3.5-397B-A17B-GGUF:MXFP4_MOE \
--ctx-size 16384 \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.00Non‑Thinking mode command (adds --chat-template-kwargs "{\"enable_thinking\": false}"):
./llama.cpp/llama-cli \
-hf unsloth/Qwen3.5-397B-A17B-GGUF:MXFP4_MOE \
--ctx-size 16384 \
--temp 0.7 \
--top-p 0.8 \
--top-k 20 \
--min-p 0.00 \
--chat-template-kwargs "{\"enable_thinking\": false}"Other useful flags: --threads 32 – adjust to the number of CPU cores. --ctx-size 16384 – context window (max 262 144). --n-gpu-layers 2 – number of layers offloaded to GPU; reduce if GPU memory is limited.
Option 2: API service with llama-server
./llama.cpp/llama-server \
--model unsloth/Qwen3.5-397B-A17B-GGUF/MXFP4_MOE/Qwen3.5-397B-A17B-MXFP4_MOE-00001-of-00006.gguf \
--alias "unsloth/Qwen3.5-397B-A17B" \
--temp 0.6 \
--top-p 0.95 \
--ctx-size 16384 \
--top-k 20 \
--min-p 0.00 \
--port 8001Clients can call the service with the standard OpenAI Python library:
from openai import OpenAI
client = OpenAI(base_url="http://127.0.0.1:8001/v1", api_key="sk-no-key-required")
completion = client.chat.completions.create(
model="unsloth/Qwen3.5-397B-A17B",
messages=[{"role": "user", "content": "用 Python 写一个快速排序"}]
)
print(completion.choices[0].message.content)Option 3: Direct inference with Hugging Face Transformers
# Serve as OpenAI‑compatible API
transformers serve --model Qwen/Qwen3.5-397B-A17B --port 8000 --continuous-batching
# Or chat directly from the command line
transformers chat Qwen/Qwen3.5-397B-A17BOption 4: High‑performance deployment with SGLang or vLLM
These solutions target production environments and require at least eight A100/H100 GPUs.
# SGLang
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-397B-A17B \
--port 8000 \
--tensor-parallel-size 8 \
--context-length 262144 \
--reasoning-parser qwen3
# vLLM
vllm serve Qwen/Qwen3.5-397B-A17B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3Option 5: Apple Silicon‑only deployment with MLX
# Text‑only inference
pip install mlx-lm
mlx_lm.chat --model Qwen/Qwen3.5-397B-A17B
# Vision + text (native multimodal)
pip install mlx-vlm
mlx_vlm.chat --model Qwen/Qwen3.5-397B-A17BTool‑calling (function calling) example
import json
from openai import OpenAI
tools = [{
"type": "function",
"function": {
"name": "add_numbers",
"description": "Add two numbers together",
"parameters": {
"type": "object",
"properties": {
"a": {"type": "number", "description": "First number"},
"b": {"type": "number", "description": "Second number"}
},
"required": ["a", "b"]
}
}
}]
def add_numbers(a, b):
return a + b
MAP_FN = {"add_numbers": add_numbers}
client = OpenAI(base_url="http://127.0.0.1:8001/v1", api_key="sk-no-key-required")
response = client.chat.completions.create(
model="unsloth/Qwen3.5-397B-A17B",
messages=[{"role": "user", "content": "123456 加上 789012 等于多少?"}],
tools=tools,
tool_choice="auto",
)
for tool_call in response.choices[0].message.tool_calls:
fn_name = tool_call.function.name
args = json.loads(tool_call.function.arguments)
result = MAP_FN[fn_name](**args)
print(f"工具调用: {fn_name}{args} = {result}")Recommended sampling parameters
Thinking mode – temperature 0.6, top_p 0.95, top_k 20, min_p 0, repeat_penalty 1.0, max context 262 144, max output 32 768 tokens.
Non‑Thinking mode – temperature 0.7, top_p 0.8, top_k 20, min_p 0, repeat_penalty 1.0, max context 262 144, max output 32 768 tokens.
Choosing the right solution
Mac hobbyist (128‑192 GB unified memory) – use llama.cpp with 2‑bit or 3‑bit GGUF.
Mac power user (256 GB M3/M4 Ultra) – use llama.cpp with MXFP4 (4‑bit).
Personal GPU server (1 × 24 GB GPU + 256 GB RAM) – use llama.cpp + llama-server.
Production environment – deploy with SGLang or vLLM (≥8 × A100/H100).
Pure‑CPU setup – run llama.cpp without GPU flags (requires 256 GB+ RAM, slower).
Running a model that can compete with a hypothetical GPT‑5.2 on a single high‑end Mac was unimaginable two years ago. Thanks to Unsloth Dynamic 2.0 quantization and the MoE sparse‑activation architecture, the barrier is now a "high‑spec Mac" rather than a multi‑node GPU cluster.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
