Artificial Intelligence
AI FinOps 2.0
19 min read

How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

This guide walks you through deploying Qwen3.5's 0.8B, 2B, 4B and 9B models on CPUs or modest GPUs using Unsloth's GGUF quantization, explains hardware requirements, shows how to run them with llama.cpp, llama‑server, vLLM or SGLang, and provides a free Colab fine‑tuning workflow with export options.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
How to Deploy and Fine‑Tune Qwen3.5 Small Models (0.8B‑9B) Locally

Introduction

The author presents a step‑by‑step tutorial for the Qwen3.5 small‑model series (0.8B, 2B, 4B, 9B). Because the official weights are distributed in HuggingFace safetensors format, which targets high‑end GPUs, the Unsloth team released GGUF‑quantized versions that run efficiently on ordinary CPUs and consumer‑grade GPUs.

Why GGUF?

Unsloth’s Dynamic 2.0 quantization keeps important layers (e.g., attention weights) at higher precision (8‑bit or 16‑bit) while aggressively compressing less critical layers. The result is a 4‑bit model whose performance is almost indistinguishable from the original FP16 checkpoint.

Memory Requirements (Quick Reference)

0.8B / 2B : runs on almost any device with ~3 GB RAM + VRAM.

4B (Q4 quant) : needs ~7 GB; a MacBook Air M1 with 8 GB RAM can handle it.

9B (Q4 quant) : needs ~9 GB; a 16 GB MacBook Pro or a GPU with 12 GB VRAM runs it comfortably.

Thus a 9B model in Q4 quant can outperform many 80B models while fitting on a laptop.

Choosing a Quantization Version

UD‑Q4_K_XL

(recommended): best balance of size and accuracy, negligible precision loss. Q4_K_M: classic 4‑bit quantization with the widest compatibility, small loss. UD‑Q2_K_XL: extreme memory saving, acceptable loss for very constrained devices. Q8_0: near‑full precision, requires more memory.

Unsloth’s KL‑divergence tests place UD‑Q4_K_XL on the Pareto front as SOTA.

Method 1 – Run Directly with llama.cpp (Recommended)

1. Compile llama.cpp

# Clone the latest repository
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
# macOS / CPU build
cmake -B build -DGGML_CUDA=OFF
cmake --build build --config Release -j
# If you have an NVIDIA GPU, enable CUDA:
# cmake -B build -DGGML_CUDA=ON
# cmake --build build --config Release -j

2. Download a GGUF model

pip install huggingface_hub hf_transfer
# Example: download the 9B Q4_K_M version
huggingface-cli download unsloth/Qwen3.5-9B-GGUF \
    --include "Qwen3.5-9B-Q4_K_M.gguf" \
    --local-dir ./models

Replace 9B with 0.8B, 2B or 4B to get other sizes.

3. Interactive chat (Non‑Thinking mode, default)

./build/bin/llama-cli \
    -m ./models/Qwen3.5-9B-Q4_K_M.gguf \
    --ctx-size 16384 \
    -cnv

That launches a simple REPL.

4. Enable Thinking mode

The small models ship with Thinking disabled. To enable it, start llama-server with an extra JSON argument:

./build/bin/llama-server \
    -m ./models/Qwen3.5-9B-Q4_K_M.gguf \
    --ctx-size 16384 \
    --chat-template-kwargs '{"enable_thinking":true}'

Now the model can emit a <think>…</think> reasoning chain.

Method 2 – Deploy as an OpenAI‑compatible API with llama-server

Start the server (Non‑Thinking is recommended for everyday use):

# Non‑Thinking (default)
./build/bin/llama-server \
    -m ./models/Qwen3.5-9B-Q4_K_M.gguf \
    --ctx-size 16384 \
    --port 8080 \
    --n-gpu-layers 35
# Thinking mode (optional)
./build/bin/llama-server \
    -m ./models/Qwen3.5-9B-Q4_K_M.gguf \
    --ctx-size 16384 \
    --port 8080 \
    --n-gpu-layers 35 \
    --chat-template-kwargs '{"enable_thinking":true}'

Consume the service from any OpenAI SDK. Example with the Python client:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="EMPTY")
response = client.chat.completions.create(
    model="Qwen3.5-9B",
    messages=[{"role": "user", "content": "用 Python 写一个快速排序"}],
    temperature=0.7,
    top_p=0.8,
    max_tokens=4096,
)
print(response.choices[0].message.content)

Method 3 – GPU‑only Path with vLLM or SGLang

If you have a discrete GPU (e.g., RTX 3060 12 GB), you can run the original FP16 weights without quantization:

# vLLM deployment
vllm serve Qwen/Qwen3.5-9B \
    --port 8000 \
    --tensor-parallel-size 1 \
    --max-model-len 32768 \
    --reasoning-parser qwen3
# SGLang deployment
python -m sglang.launch_server \
    --model-path Qwen/Qwen3.5-9B \
    --port 8000 \
    --tp-size 1 \
    --mem-fraction-static 0.8 \
    --context-length 32768 \
    --reasoning-parser qwen3

Advantages over GGUF: zero precision loss, faster GPU inference, higher concurrency, and multi‑GPU tensor parallelism. The prerequisite is a capable GPU.

Recommended Sampling Parameters

Unsloth and Qwen provide a default sampling configuration (shown in the original image). Users can start from those values and adjust as needed.

Advanced: Free Fine‑Tuning on Google Colab

Unsloth supplies ready‑to‑run Colab notebooks for each model size (0.8B, 2B, 4B, 9B). Opening a notebook launches a free T4 GPU, allowing you to train a personalized model without any local GPU.

Local fine‑tuning workflow (if you prefer your own machine)

pip install --upgrade --force-reinstall --no-cache-dir unsloth unsloth_zoo

Minimal SFT script (Python):

from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig

max_seq_length = 2048  # start small, then increase
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files={"train": url}, split="train")

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="Qwen/Qwen3.5-9B",
    max_seq_length=max_seq_length,
    load_in_4bit=True,   # 4‑bit QLoRA saves VRAM
    full_finetuning=False,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
    random_state=3407,
    max_seq_length=max_seq_length,
)

trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    tokenizer=tokenizer,
    args=SFTConfig(
        max_seq_length=max_seq_length,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        max_steps=100,  # quick sanity run
        logging_steps=1,
        output_dir="outputs_qwen35",
        optim="adamw_8bit",
        seed=3407,
    ),
)
trainer.train()

Tips for limited VRAM:

Set per_device_train_batch_size to 1.

Reduce max_seq_length (e.g., from 2048 to 1024).

Keep use_gradient_checkpointing="unsloth" enabled – it dramatically cuts memory usage while allowing longer contexts.

Even a single 12 GB T4 can fine‑tune the 9B model in 4‑bit mode.

Visual Fine‑Tuning

Qwen3.5 is a multimodal model. Unsloth also supports vision fine‑tuning via FastVisionModel:

from unsloth import FastVisionModel
model = FastVisionModel.get_peft_model(
    model,
    finetune_vision_layers=True,
    finetune_language_layers=True,
    finetune_attention_modules=True,
    finetune_mlp_modules=True,
    r=16,
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    random_state=3407,
    target_modules="all-linear",
)

You can choose to fine‑tune only vision layers, only language layers, or both.

Exporting After Fine‑Tuning

GGUF for llama.cpp / Ollama / LM Studio

# Export as Q4_K_M GGUF
model.save_pretrained_gguf("my_model", tokenizer, quantization_method="q4_k_m")
# Or export as Q8_0 GGUF
model.save_pretrained_gguf("my_model", tokenizer, quantization_method="q8_0")
# Push to HuggingFace (optional)
model.push_to_hub_gguf("your-username/my_model", tokenizer, quantization_method="q4_k_m")

16‑bit merged checkpoint for vLLM

model.save_pretrained_merged("finetuned_model", tokenizer, save_method="merged_16bit")
model.push_to_hub_merged("your-username/model", tokenizer, save_method="merged_16bit", token="")

Save only the LoRA adapter (tiny size)

model.save_pretrained("finetuned_lora")
tokenizer.save_pretrained("finetuned_lora")

The full workflow is: free Colab fine‑tuning → export GGUF → run locally with llama.cpp. No cost.

Key Fine‑Tuning Pitfalls

To retain reasoning ability, keep at least 75 % of training samples that contain a <think>…</think> block.

If the exported model’s quality drops, the most common cause is a mismatch between the chat template/EOS token used at inference and the one used during training; Unsloth warns about this automatically.

vLLM version note: as of this writing, vLLM 0.16.0 does not support Qwen3.5; support arrives in 0.17.0 or the nightly builds.

Advanced: Using the Model with Claude Code or OpenAI Codex

After launching llama-server, point the client to the local endpoint:

export OPENAI_BASE_URL=http://localhost:8080/v1
# Then configure Claude Code or OpenAI Codex to use that base URL.

A 9B model can power a local coding assistant without any API fees.

Advanced: Extending Context Length to One Million Tokens

Qwen3.5‑9B natively supports 262 k tokens. To process longer texts (e.g., whole books), enable YaRN in vLLM:

VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3.5-9B \
    --hf-overrides '{"text_config": {"rope_parameters": {"mrope_interleaved": true, "mrope_section": [11, 11, 10], "rope_type": "yarn", "rope_theta": 10000000, "partial_rotary_factor": 0.25, "factor": 4.0, "original_max_position_embeddings": 262144}}}' \
    --max-model-len 1010000

A 9B model handling a million‑token context is remarkable for a single‑GPU setup.

Model‑Selection Cheat Sheet

Raspberry Pi / IoT : 0.8B + Q4_K_M, ~5 GB.

Phone / Light laptop : 2B + Q4_K_M, ~5 GB.

MacBook Air 8 GB : 4B + UD‑Q4_K_XL, ~7 GB.

MacBook Pro 16 GB / 12 GB GPU : 9B + UD‑Q4_K_XL, ~9 GB.

Extreme lightweight : 0.8B + UD‑Q2_K_XL, ~3 GB.

The author’s personal favorite is the 9B model quantized to Q4, which scores 81.7 on the GPQA Diamond benchmark and fits into a regular notebook.

Conclusion

Barrier low: 3 GB RAM runs 0.8B; 9 GB runs 9B.

Accuracy reliable: Dynamic 2.0 Q4 quantization is virtually lossless.

Toolchain complete: llama.cpp, vLLM, SGLang, and Unsloth’s fine‑tuning suite.

Scenarios rich: chat, agents, code generation, million‑token documents.

Free fine‑tuning: Google Colab T4 GPU.

Closed‑loop export: fine‑tuned model → GGUF → local inference.

Relevant links (kept for reference):

Unsloth deployment guide: https://unsloth.ai/docs/models/qwen3.5

Unsloth fine‑tuning guide: https://unsloth.ai/docs/models/qwen3.5/fine-tune

GGUF collection: https://huggingface.co/collections/unsloth/qwen35

Qwen3.5‑9B model card: https://huggingface.co/Qwen/Qwen3.5-9B

llama.cpp repository: https://github.com/ggml-org/llama.cpp

Fine-tuningLocal DeploymentAI Modelsllama.cppGGUFQwen3.5Unsloth
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.