Distilling Claude Opus into Qwen3.6-27B – GGUF Lets You Run Locally on Consumer GPUs
The preview model Qwopus3.6-27B‑v1, distilled from Claude Opus onto Qwen3.6‑27B using SFT with the Unsloth stack and a curated 12 K high‑quality inference sample set, is evaluated on agentic reasoning, front‑end design, and Canvas/WebGL tasks with an RTX 5090, and can be deployed locally via llama.cpp GGUF quantizations with detailed memory guidelines.
Model Release
Qwopus3.6-27B-v1-preview is a distilled version of Claude Opus fine‑tuned on the open‑source Qwen3.6‑27B model. The distillation uses supervised fine‑tuning (SFT) with the Unsloth training stack. The dataset consists of roughly 12 K high‑quality inference samples primarily from Kassadin88/Claude-Distillation-Dataset and supplemented with outputs from GLM‑5.1, Kimi‑K2.5 and Qwen3.5. The model is released under the Apache‑2.0 license and is labeled as a preview version.
Training Objectives
More structured reasoning processes.
Consistent answer style that does not drift in long texts.
Alignment of style across multiple source datasets.
Foundation for larger‑scale future versions.
Data Cleaning Process
The author filtered the raw distillation data with an 8‑B instruction model used as a style filter. Samples whose response style deviated from a unified tone were removed, leaving only 12 K “style‑consistent” entries. This reduction‑instead of expansion‑approach contrasts with common practices that favor larger datasets.
Early Evaluation
Collaborator Kyle Hessling evaluated the model on a single RTX 5090 (32 GB) using llama.cpp with the GGUF‑quantized model. Sixteen prompts covering three scenarios—agentic reasoning, production‑grade front‑end design (a strength of Qwen3.6), and creative Canvas/WebGL tasks—were run and compared against the original Qwen3.6‑27B baseline.
Result screenshots show that the distilled model matches or exceeds the baseline on the selected prompts. Full evaluation report: https://huggingface.co/spaces/Jackrong/qwopus36-eval
Installation & Usage
The release provides GGUF files that can be used with llama.cpp or any GGUF‑compatible inference framework such as Ollama, LM Studio, or KoboldCpp.
Quantization options available in the repository:
Q2_K – 10.7 GB, extreme memory saving with noticeable quality loss.
Q3_K_L – memory‑friendly for 24 GB GPUs.
IQ4_XS – 15.2 GB, good quality‑to‑size ratio.
Higher‑level quantizations (Q4, Q5, Q6, Q8) – total repository size 162 GB, suitable for 40 GB+ or dual‑GPU setups.
Example commands for IQ4_XS:
# Download the model file
huggingface-cli download Jackrong/Qwopus3.6-27B-v1-preview-GGUF \
Qwopus3.6-27B-v1-preview-IQ4_XS.gguf --local-dir ./qwopus
# Start the server
./llama-server \
-m ./qwopus/Qwopus3.6-27B-v1-preview-IQ4_XS.gguf \
-c 32768 \
--host 0.0.0.0 --port 8080Memory guidelines (based on the dense 27 B model):
IQ4_XS runs on a single 24 GB GPU (e.g., 4090, 5090, 3090) with moderate context length.
Q2_K fits into 16 GB GPUs, though quality loss is significant for the full 27 B model.
Higher‑quality quantizations (Q6, Q8) require 40 GB+ memory or dual‑GPU configurations.
Ollama users can create a local Modelfile from the GGUF file using ollama create.
Warning: Qwen3.6‑27B includes a vision encoder, but the current GGUF repository contains only the pure language weights; visual support in llama.cpp must be verified independently.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
