Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy

The new Claude‑Opus‑4.6 distilled Qwen3.5‑v2 keeps code‑generation accuracy while cutting reasoning length by 24% and boosting per‑token correctness by 31.6%, offering a noticeable speed and cost advantage for local LLM deployment despite a 7.2% drop on MMLU‑Pro.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Claude‑Opus‑4.6 Distilled Qwen3.5 v2: Faster Reasoning with Same Code Accuracy

What’s new in v2?

The upgrade focuses on speed and efficiency rather than raw accuracy: the reasoning chain is 24% shorter and each token yields 31.6% more correct answers, while HumanEval pass@1 stays virtually unchanged at 96.91% (v1 was 96.95%). The only notable regression is a 7.2% drop on MMLU‑Pro.

How the improvement was achieved

Jackrong fine‑tuned Qwen3.5‑27B using Unsloth + LoRA SFT with a Response‑Only Training regime that supervises only the assistant’s thinking segment. The key ingredient is a curated set of ≈14,000 Claude‑4.6 Opus‑style general‑reasoning samples (math, logic, text) – deliberately excluding code questions.

This design teaches the model a more efficient “thinking scaffold”. The resulting reasoning pattern looks like:

Let me analyze this request carefully:

1. Identify the core objective of the problem.
2. Break the task into clearly defined subcomponents.
3. Evaluate constraints and edge cases.
4. Formulate a step‑by‑step solution plan.
5. Execute the reasoning sequentially and verify consistency.

Compared with v1’s verbose chain‑of‑thought, v2 behaves like an experienced engineer who outlines first and then proceeds, yielding a structured, concise output.

Training details

Base model: Qwen3.5‑27B

Framework: Unsloth + LoRA SFT

Method: Response‑Only Training (masking on "<|im_start|>assistant\n<think>")

Data volume: ~14k high‑quality reasoning trajectories

Datasets used:

Opus‑4.6‑Reasoning‑3000x‑filtered

claude‑opus‑4.6‑10000x

claude‑4.5‑opus‑high‑reasoning‑250x

Qwen3.5‑reasoning‑700x

Base Model (Qwen3.5‑27B)
  │
  ▼
Qwen3.5‑27B fine‑tuned with Unsloth
  │
  ▼
Supervised Fine‑Tuning (SFT) + LoRA
(Response‑Only Training masked on "<|im_start|>assistant
<think>")
  │
  ▼
Jackrong/Qwen3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled‑v2

Trade‑offs

The speed gains come at the cost of general knowledge reasoning: MMLU‑Pro accuracy falls by 7.2%, which the author attributes to the SFT data focusing on generic reasoning rather than long‑context or multi‑step tasks.

Running the model locally

Deployment requirements are unchanged: a single 4‑bit Qwen3.5‑27B can run on one RTX 4090. The GGUF checkpoint is available on HuggingFace (Jackrong/Qwen3.5‑27B‑Claude‑4.6‑Opus‑Reasoning‑Distilled‑v2‑GGUF) and works with LM Studio, llama.cpp, or Ollama.

In the author’s tests, the previous v1 achieved ~46 tokens/s on a 4090; with a 24% shorter chain, v2 effectively runs noticeably faster without additional hardware.

Bottom line

For local deployment scenarios where inference speed is the bottleneck, v2 delivers the same coding performance (HumanEval 96.91%) while using fewer tokens, cutting cost and latency, albeit with a modest loss in broad knowledge tasks.

Code accuracy unchanged: HumanEval 96.91%

Reasoning chain shortened 24% → faster generation

Per‑token correctness +31.6%

General knowledge (MMLU‑Pro) down 7.2%

Use the model when you prioritize fast, reliable code or logical reasoning over a wide‑range conversational ability.

Distillationlocal LLM deploymentClaude OpusQwen3.5reasoning efficiency
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.