Distilling Claude Opus 4.6 into Qwen3.5‑27B: High‑Quality Reasoning on a Single RTX 3090
The article details how Claude Opus 4.6's chain‑of‑thought data were used to distill the 27‑billion‑parameter Qwen3.5‑27B model with Unsloth and LoRA, achieving full‑context inference on a single RTX 3090/4090, while outlining performance numbers, hyper‑parameter tips, benchmark gains and the trade‑offs of losing multimodal abilities.
The post explains a technique that leverages Claude Opus 4.6’s high‑quality chain‑of‑thought (CoT) data to distill the open‑source Qwen3.5‑27B model, enabling the resulting model to run on a single consumer‑grade RTX 3090 or 4090 GPU.
1. Jackrong/Qwen3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled
Jackrong released an open‑source version on HuggingFace that quickly amassed tens of thousands of downloads. The training pipeline is deliberately simple: it uses the Unsloth framework with LoRA (rank = 64) and fine‑tunes on roughly 3,280 high‑quality Claude Opus 4.6 response pairs. Crucially, the train_on_responses_only strategy forces the loss calculation to occur only on the <think> block and the final answer, suppressing the intermediate task prompt and compelling the model to imitate Claude’s deep, structured reasoning.
<think>
Let me analyze this request carefully:
1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step‑by‑step solution plan.
5. Execute the reasoning sequentially and verify consistency...
</think>Empirical results show that the distilled model occupies only about 16.5 GB of VRAM, fits comfortably on a 24 GB RTX 3090, and generates at 29–35 tokens / second. It retains the full 262 K token context window, unlike earlier fine‑tuned variants that limited the window to 8 K. The model also fixes a crash that occurred in Jinja templates when the developer role was used.
2. TeichAI/Qwen3.5‑27B‑Claude‑Opus‑4.6‑Distill
TeichAI released a parallel distillation that also starts from unsloth/Qwen3.5‑27B but applies its own filtered dataset. The repository includes a “hyper‑parameter nanny guide”:
General reasoning tasks : temperature = 1.0, top_p = 0.95, min_p = 0.0 to maximize creative reasoning.
Code / web‑development (high‑precision mode) : temperature lowered to 0.6 and presence_penalty set to 0.0 to keep the model’s output tightly aligned with logical constraints.
Output length recommendations : up to 32 768 tokens for ordinary dialogue; for challenging programming‑contest problems, extend to 81 920 tokens to give the chain‑of‑thought ample space.
The model card includes a benchmark comparison (see image below) that demonstrates measurable improvements over the vanilla unsloth/Qwen3.5‑27B across several metrics.
Distillation Trade‑offs
The approach creates a reproducible pipeline: Claude reasoning data + Qwen base + Unsloth fine‑tuning. The resulting model excels at pure code, mathematics, and heavy logical reasoning, but it loses the multimodal capabilities present in the original Qwen3.5‑27B. Because the release is early, some prompt‑template bugs (e.g., occasional layout glitches) still exist.
Practitioners are encouraged to try the GGUF format locally to see whether the distilled model can serve as a cost‑effective substitute for expensive cloud APIs.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
