Qwen3.6-35B: 4‑bit Quantization, DFlash Speedup, Claude Opus Distillation
The article reviews three optimization paths for the Qwen3.6‑35B model—four‑bit AWQ quantization variants, the DFlash speculative decoding accelerator, and a Claude Opus‑based distillation—detailing their implementation steps, benchmark results, and guidance on selecting the best version for different hardware and performance needs.
First Path: Three 4‑bit Quantization Versions
The community quickly produced three 4‑bit AWQ quantized versions of Qwen3.6‑35B to reduce VRAM requirements so consumer‑grade GPUs can run the model.
cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit : Uses Activation‑aware Weight Quantization (AWQ) and can be launched directly with vllm 0.19. A user reported that two RTX 4060 GPUs achieve 83 tokens/s.
QuantTrio/Qwen3.6-35B-A3B-AWQ : Also AWQ‑based, model size ~24 GB. Provides detailed vllm launch scripts with Multi‑Token Prediction (MTP) support. Example launch command:
vllm serve QuantTrio/Qwen3.6-35B-A3B-AWQ \
--served-model-name MY_MODEL \
--max-model-len 32768 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 4 \
--enable-expert-parallel \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
--trust-remote-codeNote: when TP=8, add --enable-expert-parallel to avoid uneven expert parameter sharding.
RedHatAI/Qwen3.6-35B-A3B-NVFP4 : Uses NVFP4 format (weights + activations quantized to FP4) via the llm‑compressor tool. Preliminary evaluation shows a slight accuracy increase:
GSM8K Platinum accuracy: 96.28 % (vs. 95.62 % for the base model)
Recovery rate: 100.69 %
The Red Hat team notes these results are initial and more rigorous testing is ongoing.
All three quantized versions are compatible with vllm 0.19+ and can be started with vllm serve.
Second Path: DFlash Inference Acceleration
DFlash is a speculative decoding method based on Block Diffusion. Unlike traditional draft models that predict one token at a time, DFlash generates an entire block of tokens in parallel by injecting contextual features from the target model into each layer’s KV cache.
Benchmark on Qwen‑8B shows substantial speedups (all loss‑less, output identical to the original):
GSM8K: 5.20× faster (vs. 2.13× for EAGLE‑3)
MATH‑500: 6.17× faster (vs. 2.18× for EAGLE‑3)
HumanEval: 5.20× faster (vs. 2.48× for EAGLE‑3)
MBPP: 4.75× faster (vs. 2.27× for EAGLE‑3)
Typical “accept length” (the number of draft tokens accepted per step) is higher for DFlash, indicating larger speed gains:
GSM8K: 6.5
Math500: 7.2
HumanEval: 6.2
MBPP: 5.6
MT‑Bench: 5.0
Running DFlash with vllm requires a single command, for example:
vllm serve Qwen/Qwen3.6-35B-A3B \
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}' \
--attention-backend flash_attn \
--max-num-batched-tokens 32768SGLang also supports DFlash with a similar launch command.
Third Path: Claude Opus 4.6 Distillation
The community adapted Jackrong’s Claude‑Opus distillation pipeline from Qwen 3.5 to Qwen 3.6. The goal is to keep Qwen 3.6’s strong agentic coding base while injecting Claude Opus‑style structured reasoning.
Training uses LoRA (only the Attention modules) with the following configuration:
LoRA rank / alpha: 32 / 32
Gradient accumulation steps: 32
Training epochs: 2
Final training loss: 0.336
Maximum sequence length: 32768
Training data (~14 000 samples) comes from three public datasets:
nohurry/Opus‑4.6‑Reasoning‑3000x‑filtered (3 900 samples of Claude Opus reasoning traces)
Jackrong/Qwen3.5‑reasoning‑700x (700 curated Qwen reasoning samples)
Roman1111111/claude‑opus‑4.6‑10000x (9 633 additional Claude Opus reasoning examples)
Preliminary evaluation on the MMLU‑Pro 70‑question subset shows a notable boost:
Base model accuracy: 42.86 %
Distilled model accuracy: 75.71 % (↑ 32.85 percentage points)
The authors caution that this is a small‑scale smoke test and not a full benchmark.
The distillation was performed with pure text data; no visual or video data were used, so multimodal capabilities remain those of the base model.
Choosing the Right Variant
Based on different user needs, the article recommends:
Limited VRAM, want to run Qwen 3.6: Use an AWQ or NVFP4 quantized version.
Prioritize inference speed and can afford extra VRAM: Use the DFlash accelerated version.
Need stronger reasoning/analysis ability: Use the Claude Opus distilled version.
Want both speed and low VRAM: Consider combining quantization with DFlash (theoretically possible, pending verification).
The three routes are complementary: quantization addresses “run‑ability”, DFlash addresses “run‑fast”, and distillation addresses “run‑well”.
Author’s Perspective
Benchmark data are still limited; the distilled version was only tested on 70 MMLU‑Pro questions, the NVFP4 version has a single GSM8K score, and the quantized versions lack independent evaluations.
The DFlash draft model is still in training (≈2000 steps), so current performance numbers may improve.
Qwen 3.6’s base model is newly released; official benchmarks look strong but real‑world performance remains to be validated.
Overall, the open‑source AI community has built a full optimization chain—quantization → acceleration → distillation—around a single model, demonstrating collaborative efficiency that may be more noteworthy than any single model release.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
