Qwen3.6-35B: 4‑bit Quantization, DFlash Speedup, Claude Opus Distillation

The article reviews three optimization paths for the Qwen3.6‑35B model—four‑bit AWQ quantization variants, the DFlash speculative decoding accelerator, and a Claude Opus‑based distillation—detailing their implementation steps, benchmark results, and guidance on selecting the best version for different hardware and performance needs.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Qwen3.6-35B: 4‑bit Quantization, DFlash Speedup, Claude Opus Distillation

First Path: Three 4‑bit Quantization Versions

The community quickly produced three 4‑bit AWQ quantized versions of Qwen3.6‑35B to reduce VRAM requirements so consumer‑grade GPUs can run the model.

cyankiwi/Qwen3.6-35B-A3B-AWQ-4bit : Uses Activation‑aware Weight Quantization (AWQ) and can be launched directly with vllm 0.19. A user reported that two RTX 4060 GPUs achieve 83 tokens/s.

QuantTrio/Qwen3.6-35B-A3B-AWQ : Also AWQ‑based, model size ~24 GB. Provides detailed vllm launch scripts with Multi‑Token Prediction (MTP) support. Example launch command:

vllm serve QuantTrio/Qwen3.6-35B-A3B-AWQ \
    --served-model-name MY_MODEL \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 4 \
    --enable-expert-parallel \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' \
    --trust-remote-code

Note: when TP=8, add --enable-expert-parallel to avoid uneven expert parameter sharding.

RedHatAI/Qwen3.6-35B-A3B-NVFP4 : Uses NVFP4 format (weights + activations quantized to FP4) via the llm‑compressor tool. Preliminary evaluation shows a slight accuracy increase:

GSM8K Platinum accuracy: 96.28 % (vs. 95.62 % for the base model)

Recovery rate: 100.69 %

The Red Hat team notes these results are initial and more rigorous testing is ongoing.

All three quantized versions are compatible with vllm 0.19+ and can be started with vllm serve.

Second Path: DFlash Inference Acceleration

DFlash is a speculative decoding method based on Block Diffusion. Unlike traditional draft models that predict one token at a time, DFlash generates an entire block of tokens in parallel by injecting contextual features from the target model into each layer’s KV cache.

Benchmark on Qwen‑8B shows substantial speedups (all loss‑less, output identical to the original):

GSM8K: 5.20× faster (vs. 2.13× for EAGLE‑3)

MATH‑500: 6.17× faster (vs. 2.18× for EAGLE‑3)

HumanEval: 5.20× faster (vs. 2.48× for EAGLE‑3)

MBPP: 4.75× faster (vs. 2.27× for EAGLE‑3)

Typical “accept length” (the number of draft tokens accepted per step) is higher for DFlash, indicating larger speed gains:

GSM8K: 6.5

Math500: 7.2

HumanEval: 6.2

MBPP: 5.6

MT‑Bench: 5.0

Running DFlash with vllm requires a single command, for example:

vllm serve Qwen/Qwen3.6-35B-A3B \
  --speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}' \
  --attention-backend flash_attn \
  --max-num-batched-tokens 32768

SGLang also supports DFlash with a similar launch command.

Third Path: Claude Opus 4.6 Distillation

The community adapted Jackrong’s Claude‑Opus distillation pipeline from Qwen 3.5 to Qwen 3.6. The goal is to keep Qwen 3.6’s strong agentic coding base while injecting Claude Opus‑style structured reasoning.

Training uses LoRA (only the Attention modules) with the following configuration:

LoRA rank / alpha: 32 / 32

Gradient accumulation steps: 32

Training epochs: 2

Final training loss: 0.336

Maximum sequence length: 32768

Training data (~14 000 samples) comes from three public datasets:

nohurry/Opus‑4.6‑Reasoning‑3000x‑filtered (3 900 samples of Claude Opus reasoning traces)

Jackrong/Qwen3.5‑reasoning‑700x (700 curated Qwen reasoning samples)

Roman1111111/claude‑opus‑4.6‑10000x (9 633 additional Claude Opus reasoning examples)

Preliminary evaluation on the MMLU‑Pro 70‑question subset shows a notable boost:

Base model accuracy: 42.86 %

Distilled model accuracy: 75.71 % (↑ 32.85 percentage points)

The authors caution that this is a small‑scale smoke test and not a full benchmark.

The distillation was performed with pure text data; no visual or video data were used, so multimodal capabilities remain those of the base model.

Choosing the Right Variant

Based on different user needs, the article recommends:

Limited VRAM, want to run Qwen 3.6: Use an AWQ or NVFP4 quantized version.

Prioritize inference speed and can afford extra VRAM: Use the DFlash accelerated version.

Need stronger reasoning/analysis ability: Use the Claude Opus distilled version.

Want both speed and low VRAM: Consider combining quantization with DFlash (theoretically possible, pending verification).

The three routes are complementary: quantization addresses “run‑ability”, DFlash addresses “run‑fast”, and distillation addresses “run‑well”.

Author’s Perspective

Benchmark data are still limited; the distilled version was only tested on 70 MMLU‑Pro questions, the NVFP4 version has a single GSM8K score, and the quantized versions lack independent evaluations.

The DFlash draft model is still in training (≈2000 steps), so current performance numbers may improve.

Qwen 3.6’s base model is newly released; official benchmarks look strong but real‑world performance remains to be validated.

Overall, the open‑source AI community has built a full optimization chain—quantization → acceleration → distillation—around a single model, demonstrating collaborative efficiency that may be more noteworthy than any single model release.

AIQuantizationvLLMDistillationQwen3.6DFlash
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.