From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide
This guide shows how to fine‑tune Qwen3.5 models—from 0.8B to 122B—using Unsloth Studio or pure code, covering text SFT, vision fine‑tuning, MoE models, reinforcement‑learning (GRPO), extensive GGUF quantization benchmarks, hardware requirements, export formats, and deployment tips.
Qwen3.5 Fine‑Tuning Overview
Core advantages
Training speed 1.5× faster than standard FA2.
VRAM usage reduced by ~50%.
Supports the full model series: 0.8B, 2B, 4B, 9B, 27B, 35B‑A3B (MoE), 122B‑A10B (MoE).
Three fine‑tuning routes: text SFT, vision, reinforcement learning (GRPO).
Export formats: GGUF (Ollama/llama.cpp), 16‑bit for vLLM, LoRA adapters.
Multilingual fine‑tuning for 201 languages.
BF16 LoRA VRAM requirements
0.8B – 3 GB
2B – 5 GB
4B – 10 GB
9B – 22 GB
27B – 56 GB
35B‑A3B – 74 GB
Important reminders
Use transformers v5; earlier versions are incompatible.
QLoRA (4‑bit) is not recommended for Qwen3.5 because of high quantization error.
For MoE models (35B‑A3B, 122B‑A10B) use BF16 LoRA and avoid QLoRA.
Method 1: Unsloth Studio (no‑code)
Installation (macOS/Linux/WSL): curl -fsSL https://unsloth.ai/install.sh | sh Windows PowerShell: irm https://unsloth.ai/install.ps1 | iex Start the UI: unsloth studio -H 0.0.0.0 -p 8888 Open http://localhost:8888, set a password, select the Qwen3.5 model, choose a dataset, adjust parameters, and launch training with mouse clicks. The UI displays real‑time loss curves and allows direct export to GGUF or safetensor formats.
Method 2: Code‑Based SFT (text fine‑tuning)
Minimal runnable script (requires unsloth and datasets libraries):
from unsloth import FastLanguageModel
import torch
from datasets import load_dataset
from trl import SFTTrainer, SFTConfig
max_seq_length = 2048 # start small
url = "https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl"
dataset = load_dataset("json", data_files={"train": url}, split="train")
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "Qwen/Qwen3.5-27B",
max_seq_length = max_seq_length,
load_in_4bit = False,
load_in_16bit = True, # bf16 LoRA
full_finetuning = False,
)
model = FastLanguageModel.get_peft_model(
model,
r = 16,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
use_gradient_checkpointing = "unsloth",
random_state = 3407,
max_seq_length = max_seq_length,
)
trainer = SFTTrainer(
model = model,
train_dataset = dataset,
tokenizer = tokenizer,
args = SFTConfig(
max_seq_length = max_seq_length,
per_device_train_batch_size = 1,
gradient_accumulation_steps = 4,
warmup_steps = 10,
max_steps = 100,
logging_steps = 1,
output_dir = "outputs_qwen35",
optim = "adamw_8bit",
seed = 3407,
dataset_num_proc = 1,
),
)
trainer.train()Key parameters load_in_16bit = True – enables BF16 LoRA for stability. use_gradient_checkpointing = "unsloth" – reduces memory consumption. r = 16 – LoRA rank; larger values increase capacity but may overfit. lora_alpha = 16 – recommended α ≥ r.
If out‑of‑memory occurs, lower per_device_train_batch_size to 1 or reduce max_seq_length.
MoE Model Fine‑Tuning (35B / 122B)
from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
model_name = "unsloth/Qwen3.5-35B-A3B",
max_seq_length = 2048,
load_in_4bit = False,
load_in_16bit = True,
full_finetuning = False,
)Unsloth’s MoE kernel is reported to be 12× faster than the standard implementation, cut VRAM by 35%, and extend context length 6×. For the 122B‑A10B model, BF16 LoRA requires ~256 GB VRAM; use device_map = "balanced" for multi‑GPU setups.
Vision Fine‑Tuning (VLM)
from unsloth import FastVisionModel
model, tokenizer = FastVisionModel.from_pretrained(
"unsloth/Qwen3.5-4B",
load_in_4bit = False,
use_gradient_checkpointing = "unsloth",
)
model = FastVisionModel.get_peft_model(
model,
finetune_vision_layers = True,
finetune_language_layers = True,
finetune_attention_modules = True,
finetune_mlp_modules = True,
r = 16,
lora_alpha = 16,
lora_dropout = 0,
bias = "none",
random_state = 3407,
target_modules = "all-linear",
modules_to_save = ["lm_head", "embed_tokens"],
)The script allows fine‑grained control over which parts (vision, language, attention, MLP) are adapted.
Free T4 GPU users can run the official Colab notebook at https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Qwen3_5_(4B)_Vision.ipynb
Reinforcement Learning (GRPO)
Because vLLM does not yet support Qwen3.5, disable fast inference to train with GRPO:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/Qwen3.5-4B",
fast_inference = False, # disable vLLM fast inference
)Keep at least 75% of training examples in inference‑style format to preserve downstream reasoning ability.
GGUF Quantization Benchmark – Recommendations
Unsloth performed >150 KL‑divergence benchmarks (≈9 TB of GGUF). Key findings:
Avoid MXFP4; it performs poorly across most tensors. Q4_K outperforms it in almost all scenarios.
Do not quantize the ssm_out (Mamba) layer – KLD spikes with minimal storage gain.
3‑bit quantization is the sweet spot: ffn_up_exps and ffn_gate_exps can be quantized to ~3‑bit (iq3_xxs). 2‑bit leads to noticeable degradation.
Imatrix quantization reduces KLD and perplexity at the cost of 5‑10% slower inference; benefits low‑bit quantization.
Attention layers are highly sensitive; for MoE models they should remain high‑precision.
Perplexity (PPL) and KL can be misleading: Unsloth Dynamic IQ2_XXS outperforms AesSedai IQ3_S on real‑world benchmarks (LiveCodeBench v6, MMLU Pro) despite a smaller file size, while the latter shows better PPL/KL.
Export and Deployment
After fine‑tuning, export to the desired format:
GGUF for Ollama / llama.cpp:
model.save_pretrained_gguf("directory", tokenizer, quantization_method="q4_k_m")
model.save_pretrained_gguf("directory", tokenizer, quantization_method="q8_0")16‑bit for vLLM:
model.save_pretrained_merged("finetuned_model", tokenizer, save_method="merged_16bit")Only LoRA adapter:
model.save_pretrained("finetuned_lora")
tokenizer.save_pretrained("finetuned_lora")Push to HuggingFace:
model.push_to_hub_gguf("hf_username/model", tokenizer, quantization_method="q4_k_m")Note: vLLM 0.16.0 does not support Qwen3.5; support is added in v0.170 or nightly builds. If inference quality degrades after export, verify that the chat template and EOS token match the training configuration.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
