Qwen3.6-27B Open‑Source: How a 27B Dense Model Outperforms the 397B Giant
The newly released Qwen3.6-27B dense multimodal model, at just 27 B parameters, surpasses the 397 B flagship on most encoding benchmarks, offers up to 1 M token context, supports FP8 quantization, and can be deployed locally via vLLM, SGLang or Transformers with modest hardware.
Model Overview
Qwen3.6-27B is a 27 B dense multimodal model that surpasses the previous open‑source flagship Qwen3.5‑397B‑A17B on most coding benchmarks.
SWE‑bench Verified: 77.2 (vs 76.2 for 3.5‑397B)
SWE‑bench Pro: 53.5 (vs 50.9)
Terminal‑Bench 2.0: 59.3 (vs 52.5)
SkillsBench Avg5: 48.2 (vs 30.0)
GPQA Diamond: 87.8
AIME 2026: 94.1
Compared with the closed‑source Claude 4.5 Opus, the gap on coding benchmarks is 1‑5 points and Terminal‑Bench scores are identical (59.3).
Key Advantages
Agentic coding: Real‑world coding tasks, especially front‑end and repository‑level, outperform Claude.
Thinking preservation: Multi‑turn conversations keep reasoning context, avoiding repeated “thinking” in iterative coding.
Architecture
Parameters: 27 B dense (no MoE)
Layers: 64, hidden dimension 5120
Native context length: 262 144 tokens, extendable to 1 010 000 tokens
Hidden layout:
16 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN))Multimodal vision encoder supports images, video, and documents
Supports MTP (Multi‑Token Prediction) for inference speedup
Gated DeltaNet + Gated Attention mix is more memory‑friendly than pure attention for long contexts.
FP8 Quantized Version
The FP8 weight file is ~30 GB. Using the Qwen/Qwen3.6-27B-FP8 checkpoint halves memory usage while performance loss is reported as negligible.
Why 27 B Is a Sweet Spot
Easy deployment: Dense architecture works directly with vLLM or SGLang without expert parallelism.
Moderate hardware requirements: BF16 needs ~54 GB VRAM (e.g., 2 × A100 40 GB, 1 × H100 80 GB, or 4 × RTX 4090). FP8 needs ~27 GB (single 48 GB L40S/A6000).
No capability compromise: Benchmarks show it outperforms the 397 B model.
Fully open weights: Available on Hugging Face and ModelScope for unrestricted commercial use.
Local Deployment Options
Officially supported routes: vLLM, SGLang, and Hugging Face Transformers. KTransformers also supports CPU‑GPU heterogeneous inference.
vLLM Deployment (recommended)
uv pip install vllm --torch-backend=auto vllm serve Qwen/Qwen3.6-27B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3With tool‑call (required for coding agents):
vllm serve Qwen/Qwen3.6-27B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coderEnable MTP (speculative decoding):
vllm serve Qwen/Qwen3.6-27B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'Text‑only mode (drops vision encoder):
vllm serve Qwen/Qwen3.6-27B \
--port 8000 \
--tensor-parallel-size 8 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--language-model-onlyOOM tip: If out‑of‑memory occurs, do not reduce context below 128 K; the model’s thinking ability degrades sharply.
SGLang Deployment
uv pip install sglang[all] python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--port 8000 \
--tp-size 8 \
--mem-fraction-static 0.8 \
--context-length 262144 \
--reasoning-parser qwen3With tool use:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--port 8000 \
--tp-size 8 \
--mem-fraction-static 0.8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--tool-call-parser qwen3_coderEnable speculative MTP:
python -m sglang.launch_server \
--model-path Qwen/Qwen3.6-27B \
--port 8000 \
--tp-size 8 \
--mem-fraction-static 0.8 \
--context-length 262144 \
--reasoning-parser qwen3 \
--speculative-algo NEXTN \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4Transformers Lightweight Deployment (testing only)
pip install "transformers[serving]"
transformers serve Qwen/Qwen3.6-27B --port 8000 --continuous-batchingThis option is suitable for experiments; production should use vLLM or SGLang.
FP8 Quantized Model
Replace the model name with Qwen/Qwen3.6-27B-FP8 and keep the same launch parameters. Example with reduced tensor parallel size:
vllm serve Qwen/Qwen3.6-27B-FP8 \
--port 8000 \
--tensor-parallel-size 2 \
--max-model-len 131072 \
--reasoning-parser qwen3Sampling Parameters (official recommendations)
General thinking mode: temperature=1.0, top_p=0.95, top_k=20, presence_penalty=0.0 Precise coding (e.g., WebDev): temperature=0.6, top_p=0.95, top_k=20 Non‑thinking mode:
temperature=0.7, top_p=0.80, top_k=20, presence_penalty=1.5OpenAI‑Compatible API Usage
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
messages = [{"role": "user", "content": "Type \"I love Qwen3.6\" backwards"}]
resp = client.chat.completions.create(
model="Qwen/Qwen3.6-27B",
messages=messages,
max_tokens=81920,
temperature=1.0,
top_p=0.95,
presence_penalty=0.0,
extra_body={"top_k": 20},
)
print(resp)When thinking mode is enabled, responses include <think>...</think> blocks; switch to non‑thinking parameters to suppress them.
Multimodal Request Example
messages = [{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://your-image-url.jpg"}},
{"type": "text", "text": "这张图里有几个圆?"}
]
}]
resp = client.chat.completions.create(
model="Qwen/Qwen3.6-27B",
messages=messages,
max_tokens=81920,
temperature=1.0,
top_p=0.95,
extra_body={"top_k": 20},
)For video input, replace the type field with video_url.
Pros and Cons
Pros:
27 B dense size enables friendly deployment.
Agentic coding ability surpasses the 397 B MoE model.
Native 262 K context, extendable to 1 M tokens.
Multimodal + text capabilities in a single model.
FP8 quantized version halves memory requirements.
Full‑stack support: vLLM, SGLang, Transformers, KTransformers.
Cons:
Very hard reasoning tasks (e.g., HLE) still favor the 397 B model or Claude 4.5 Opus.
Default thinking mode adds latency; latency‑sensitive production may need to disable it.
Context length should not be reduced below 128 K, otherwise thinking degrades.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
