Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy
The new Claude‑Opus‑4.6 distilled Qwen3.5‑v2 keeps code‑generation accuracy while cutting reasoning length by 24% and boosting per‑token correctness by 31.6%, offering a noticeable speed and cost advantage for local LLM deployment despite a 7.2% drop on MMLU‑Pro.
What’s new in v2?
The upgrade focuses on speed and efficiency rather than raw accuracy: the reasoning chain is 24% shorter and each token yields 31.6% more correct answers, while HumanEval pass@1 stays virtually unchanged at 96.91% (v1 was 96.95%). The only notable regression is a 7.2% drop on MMLU‑Pro.
How the improvement was achieved
Jackrong fine‑tuned Qwen3.5‑27B using Unsloth + LoRA SFT with a Response‑Only Training regime that supervises only the assistant’s thinking segment. The key ingredient is a curated set of ≈14,000 Claude‑4.6 Opus‑style general‑reasoning samples (math, logic, text) – deliberately excluding code questions.
This design teaches the model a more efficient “thinking scaffold”. The resulting reasoning pattern looks like:
Let me analyze this request carefully:
1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step‑by‑step solution plan.
5. Execute the reasoning sequentially and verify consistency.Compared with v1’s verbose chain‑of‑thought, v2 behaves like an experienced engineer who outlines first and then proceeds, yielding a structured, concise output.
Training details
Base model: Qwen3.5‑27B
Framework: Unsloth + LoRA SFT
Method: Response‑Only Training (masking on "<|im_start|>assistant\n<think>")
Data volume: ~14k high‑quality reasoning trajectories
Datasets used:
Opus‑4.6‑Reasoning‑3000x‑filtered
claude‑opus‑4.6‑10000x
claude‑4.5‑opus‑high‑reasoning‑250x
Qwen3.5‑reasoning‑700x
Base Model (Qwen3.5‑27B)
│
▼
Qwen3.5‑27B fine‑tuned with Unsloth
│
▼
Supervised Fine‑Tuning (SFT) + LoRA
(Response‑Only Training masked on "<|im_start|>assistant
<think>")
│
▼
Jackrong/Qwen3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled‑v2Trade‑offs
The speed gains come at the cost of general knowledge reasoning: MMLU‑Pro accuracy falls by 7.2%, which the author attributes to the SFT data focusing on generic reasoning rather than long‑context or multi‑step tasks.
Running the model locally
Deployment requirements are unchanged: a single 4‑bit Qwen3.5‑27B can run on one RTX 4090. The GGUF checkpoint is available on HuggingFace (Jackrong/Qwen3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled‑v2‑GGUF) and works with LM Studio, llama.cpp, or Ollama.
In the author’s tests, the previous v1 achieved ~46 tokens/s on a 4090; with a 24% shorter chain, v2 effectively runs noticeably faster without additional hardware.
Bottom line
For local deployment scenarios where inference speed is the bottleneck, v2 delivers the same coding performance (HumanEval 96.91%) while using fewer tokens, cutting cost and latency, albeit with a modest loss in broad knowledge tasks.
Code accuracy unchanged: HumanEval 96.91%
Reasoning chain shortened 24% → faster generation
Per‑token correctness +31.6%
General knowledge (MMLU‑Pro) down 7.2%
Use the model when you prioritize fast, reliable code or logical reasoning over a wide‑range conversational ability.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
