How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally
This guide walks you through deploying Qwen3.5's 0.8B, 2B, 4B and 9B models on CPUs or modest GPUs using Unsloth's GGUF quantization, explains hardware requirements, shows how to run them with llama.cpp, llama‑server, vLLM or SGLang, and provides a free Colab fine‑tuning workflow with export options.
Introduction
The author presents a step‑by‑step tutorial for the Qwen3.5 small‑model series (0.8B, 2B, 4B, 9B). Because the official weights are distributed in HuggingFace safetensors format, which targets high‑end GPUs, the Unsloth team released GGUF‑quantized versions that run efficiently on ordinary CPUs and consumer‑grade GPUs.
Why GGUF?
Unsloth’s Dynamic 2.0 quantization keeps important layers (e.g., attention weights) at higher precision (8‑bit or 16‑bit) while aggressively compressing less critical layers. The result is a 4‑bit model whose performance is almost indistinguishable from the original FP16 checkpoint.
Memory Requirements (Quick Reference)
0.8B / 2B : runs on almost any device with ~3 GB RAM + VRAM.
4B (Q4 quant) : needs ~7 GB; a MacBook Air M1 with 8 GB RAM can handle it.
9B (Q4 quant) : needs ~9 GB; a 16 GB MacBook Pro or a GPU with 12 GB VRAM runs it comfortably.
Thus a 9B model in Q4 quant can outperform many 80B models while fitting on a laptop.
Choosing a Quantization Version
UD‑Q4_K_XL(recommended): best balance of size and accuracy, negligible precision loss. Q4_K_M: classic 4‑bit quantization with the widest compatibility, small loss. UD‑Q2_K_XL: extreme memory saving, acceptable loss for very constrained devices. Q8_0: near‑full precision, requires more memory.
Unsloth’s KL‑divergence tests place UD‑Q4_K_XL on the Pareto front as SOTA.
Method 1 – Run Directly with llama.cpp (Recommended)
1. Compile llama.cpp
# Clone the latest repository
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
# macOS / CPU build
cmake -B build -DGGML_CUDA=OFF
cmake --build build --config Release -j
# If you have an NVIDIA GPU, enable CUDA:
# cmake -B build -DGGML_CUDA=ON
# cmake --build build --config Release -j2. Download a GGUF model
pip install huggingface_hub hf_transfer
# Example: download the 9B Q4_K_M version
huggingface-cli download unsloth/Qwen3.5-9B-GGUF \
--include "Qwen3.5-9B-Q4_K_M.gguf" \
--local-dir ./modelsReplace 9B with 0.8B, 2B or 4B to get other sizes.
3. Interactive chat (Non‑Thinking mode, default)
./build/bin/llama-cli \
-m ./models/Qwen3.5-9B-Q4_K_M.gguf \
--ctx-size 16384 \
-cnvThat launches a simple REPL.
4. Enable Thinking mode
The small models ship with Thinking disabled. To enable it, start llama-server with an extra JSON argument:
./build/bin/llama-server \
-m ./models/Qwen3.5-9B-Q4_K_M.gguf \
--ctx-size 16384 \
--chat-template-kwargs '{"enable_thinking":true}'Now the model can emit a <think>…</think> reasoning chain.
Method 2 – Deploy as an OpenAI‑compatible API with llama-server
Start the server (Non‑Thinking is recommended for everyday use):
# Non‑Thinking (default)
./build/bin/llama-server \
-m ./models/Qwen3.5-9B-Q4_K_M.gguf \
--ctx-size 16384 \
--port 8080 \
--n-gpu-layers 35
# Thinking mode (optional)
./build/bin/llama-server \
-m ./models/Qwen3.5-9B-Q4_K_M.gguf \
--ctx-size 16384 \
--port 8080 \
--n-gpu-layers 35 \
--chat-template-kwargs '{"enable_thinking":true}'Consume the service from any OpenAI SDK. Example with the Python client:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="Qwen3.5-9B",
messages=[{"role": "user", "content": "用 Python 写一个快速排序"}],
temperature=0.7,
top_p=0.8,
max_tokens=4096,
)
print(response.choices[0].message.content)Method 3 – GPU‑only Path with vLLM or SGLang
If you have a discrete GPU (e.g., RTX 3060 12 GB), you can run the original FP16 weights without quantization:
# vLLM deployment
vllm serve Qwen/Qwen3.5-9B \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 32768 \
--reasoning-parser qwen3
# SGLang deployment
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-9B \
--port 8000 \
--tp-size 1 \
--mem-fraction-static 0.8 \
--context-length 32768 \
--reasoning-parser qwen3Advantages over GGUF: zero precision loss, faster GPU inference, higher concurrency, and multi‑GPU tensor parallelism. The prerequisite is a capable GPU.
Recommended Sampling Parameters
Unsloth and Qwen provide a default sampling configuration (shown in the original image). Users can start from those values and adjust as needed.
Advanced: Free Fine‑Tuning on Google Colab
Unsloth supplies ready‑to‑run Colab notebooks for each model size (0.8B, 2B, 4B, 9B). Opening a notebook launches a free T4 GPU, allowing you to train a personalized model without any local GPU.
Local fine‑tuning workflow (if you prefer your own machine)
pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zooMinimal SFT script (Python):
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
max_seq_length = 2048 # start small, then increase
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files={"train": url}, split="train")
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="Qwen/Qwen3.5-9B",
max_seq_length=max_seq_length,
load_in_4bit=True, # 4‑bit QLoRA saves VRAM
full_finetuning=False,
)
model = FastLanguageModel.get_peft_model(
model,
r=16,
target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
random_state=3407,
max_seq_length=max_seq_length,
)
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
tokenizer=tokenizer,
args=SFTConfig(
max_seq_length=max_seq_length,
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
warmup_steps=10,
max_steps=100, # quick sanity run
logging_steps=1,
output_dir="outputs_qwen35",
optim="adamw_8bit",
seed=3407,
),
)
trainer.train()Tips for limited VRAM:
Set per_device_train_batch_size to 1.
Reduce max_seq_length (e.g., from 2048 to 1024).
Keep use_gradient_checkpointing="unsloth" enabled – it dramatically cuts memory usage while allowing longer contexts.
Even a single 12 GB T4 can fine‑tune the 9B model in 4‑bit mode.
Visual Fine‑Tuning
Qwen3.5 is a multimodal model. Unsloth also supports vision fine‑tuning via FastVisionModel:
from unsloth import FastVisionModel
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers=True,
finetune_language_layers=True,
finetune_attention_modules=True,
finetune_mlp_modules=True,
r=16,
lora_alpha=16,
lora_dropout=0,
bias="none",
random_state=3407,
target_modules="all-linear",
)You can choose to fine‑tune only vision layers, only language layers, or both.
Exporting After Fine‑Tuning
GGUF for llama.cpp / Ollama / LM Studio
# Export as Q4_K_M GGUF
model.save_pretrained_gguf("my_model", tokenizer, quantization_method="q4_k_m")
# Or export as Q8_0 GGUF
model.save_pretrained_gguf("my_model", tokenizer, quantization_method="q8_0")
# Push to HuggingFace (optional)
model.push_to_hub_gguf("your-username/my_model", tokenizer, quantization_method="q4_k_m")16‑bit merged checkpoint for vLLM
model.save_pretrained_merged("finetuned_model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("your-username/model", tokenizer, save_method="merged_16bit", token="")Save only the LoRA adapter (tiny size)
model.save_pretrained("finetuned_lora")
tokenizer.save_pretrained("finetuned_lora")The full workflow is: free Colab fine‑tuning → export GGUF → run locally with llama.cpp. No cost.
Key Fine‑Tuning Pitfalls
To retain reasoning ability, keep at least 75 % of training samples that contain a <think>…</think> block.
If the exported model’s quality drops, the most common cause is a mismatch between the chat template/EOS token used at inference and the one used during training; Unsloth warns about this automatically.
vLLM version note: as of this writing, vLLM 0.16.0 does not support Qwen3.5; support arrives in 0.17.0 or the nightly builds.
Advanced: Using the Model with Claude Code or OpenAI Codex
After launching llama-server, point the client to the local endpoint:
export OPENAI_BASE_URL=http://localhost:8080/v1
# Then configure Claude Code or OpenAI Codex to use that base URL.A 9B model can power a local coding assistant without any API fees.
Advanced: Extending Context Length to One Million Tokens
Qwen3.5‑9B natively supports 262 k tokens. To process longer texts (e.g., whole books), enable YaRN in vLLM:
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.5-9B \
--hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' \
--max-model-len 1010000A 9B model handling a million‑token context is remarkable for a single‑GPU setup.
Model‑Selection Cheat Sheet
Raspberry Pi / IoT : 0.8B + Q4_K_M, ~5 GB.
Phone / Light laptop : 2B + Q4_K_M, ~5 GB.
MacBook Air 8 GB : 4B + UD‑Q4_K_XL, ~7 GB.
MacBook Pro 16 GB / 12 GB GPU : 9B + UD‑Q4_K_XL, ~9 GB.
Extreme lightweight : 0.8B + UD‑Q2_K_XL, ~3 GB.
The author’s personal favorite is the 9B model quantized to Q4, which scores 81.7 on the GPQA Diamond benchmark and fits into a regular notebook.
Conclusion
Barrier low: 3 GB RAM runs 0.8B; 9 GB runs 9B.
Accuracy reliable: Dynamic 2.0 Q4 quantization is virtually lossless.
Toolchain complete: llama.cpp, vLLM, SGLang, and Unsloth’s fine‑tuning suite.
Scenarios rich: chat, agents, code generation, million‑token documents.
Free fine‑tuning: Google Colab T4 GPU.
Closed‑loop export: fine‑tuned model → GGUF → local inference.
Relevant links (kept for reference):
Unsloth deployment guide: https://unsloth.ai/docs/models/qwen3.5
Unsloth fine‑tuning guide: https://unsloth.ai/docs/models/qwen3.5/fine-tune
GGUF collection: https://huggingface.co/collections/unsloth/qwen35
Qwen3.5‑9B model card: https://huggingface.co/Qwen/Qwen3.5-9B
llama.cpp repository: https://github.com/ggml-org/llama.cpp
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
