Qwen3.5 Small Models Unveiled: From 0.8B to 9B with Full Capabilities
The article introduces the newly released Qwen3.5 small model series (0.8B, 2B, 4B, 9B), explains their shared Gated Delta Networks architecture, early multimodal token fusion, 201‑language support and up to 1 million‑token context, and presents benchmark data that show the 9B model rivaling much larger LLMs, followed by practical guidance on model selection and deployment.
Release of Qwen3.5 Small Model Series
Today the Qwen team announced the full line of Qwen3.5 small models—0.8B, 2B, 4B, and 9B—along with matching Base versions. All models support visual input (Image‑Text‑to‑Text), provide Instruct and Base variants, and are completely open‑source.
Model lineup
Qwen3.5‑0.8B : 0.9 B parameters, dense architecture, targeted at ultra‑lightweight edge devices.
Qwen3.5‑2B : 2 B parameters, dense, suited for mobile, IoT, and other on‑device scenarios.
Qwen3.5‑4B : 5 B parameters, dense, designed for lightweight multimodal applications and micro‑agents.
Qwen3.5‑9B : 10 B parameters, dense, positioned as a high‑performance yet cost‑effective model.
Qwen3.5‑27B : 28 B parameters, dense, the recommended “stable” choice for local deployment.
Qwen3.5‑35B‑A3B and larger MoE variants (122B‑A10B, 397B‑A17B) are also listed for reference.
Architectural highlights (not a scaled‑down version)
The four new small models inherit the same architecture as the larger Qwen3.5 family:
Gated Delta Networks : For the 9B model the attention block is expressed as
8 × (3 × (Gated DeltaNet → FFN) → 1 × (Gated Attention → FFN)). Every four layers contain three linear‑attention (Gated DeltaNet) layers and one full‑attention layer, dramatically reducing memory and compute for long‑context inference.
Native multimodal early fusion : Visual tokens are fused during pre‑training rather than added later, giving multimodal training efficiency close to pure‑text training.
201 language support : The models cover 201 languages, facilitating global deployment.
Extended context length : The 9B model natively handles 262,144 tokens and can be stretched to over 1 million tokens.
Benchmark results: How strong are the 9B and 4B models?
Official benchmarks highlight several surprising numbers:
GPQA Diamond : Qwen3.5‑9B scores 81.7, surpassing the 120B GPT‑OSS model (80.1).
Long‑context ability : AA‑LCR 63.0 and LongBench v2 55.2, leading all compared models.
HMMT math competition : 9B achieves 83.2, only behind GPT‑OSS‑120B (90.0) and well above Qwen3‑Next‑80B‑A3B‑Thinking (73.7).
Agent performance : BFCL‑V4 66.1 and TAU2‑Bench 79.1, far exceeding the 80B‑scale baseline.
4B surprise : TAU2‑Bench 79.9, marginally higher than the 9B score, showing strong agent capability for its size.
Choosing the right small model
Edge / IoT devices → Qwen3.5‑0.8B (sub‑1B parameters, extreme lightweight).
On‑device agents / simple QA → Qwen3.5‑2B (2B parameters, fast inference).
Lightweight multimodal or micro‑agent apps → Qwen3.5‑4B (5B parameters, native vision).
Cost‑effective local deployment → Qwen3.5‑9B (comparable overall ability to 27B).
Stable, high‑capacity local deployment → Qwen3.5‑27B (recommended if GPU memory permits).
9B vs 27B: Is the larger model worth it?
Whether to adopt 27B depends on the scenario:
Inference & math : 9B already matches 27B on most benchmarks (e.g., HMMT 83.2).
Long context : 9B’s linear attention gives AA‑LCR 63.0, potentially outperforming 27B.
Agent tasks : 9B’s BFCL‑V4 66.1 is excellent.
Code generation : 27B is expected to be more stable due to its size.
Output stability : Dense 27B should produce more consistent long‑form text.
If a 4090‑class GPU is available, 27B remains the safest choice; otherwise, especially for Mac users or limited VRAM, the 9B model is highly recommended.
Quick start
All Qwen3.5 small models enable the Thinking mode by default, using the same interface as the larger models.
HuggingFace repositories:
Qwen3.5‑9B: https://huggingface.co/Qwen/Qwen3.5-9B
Qwen3.5‑9B‑Base: https://huggingface.co/Qwen/Qwen3.5-9B-Base
Qwen3.5‑4B: https://huggingface.co/Qwen/Qwen3.5-4B
Qwen3.5‑4B‑Base: https://huggingface.co/Qwen/Qwen3.5-4B-Base
Qwen3.5‑2B: https://huggingface.co/Qwen/Qwen3.5-2B
Qwen3.5‑2B‑Base: https://huggingface.co/Qwen/Qwen3.5-2B-Base
Qwen3.5‑0.8B: https://huggingface.co/Qwen/Qwen3.5-0.8B
Qwen3.5‑0.8B‑Base: https://huggingface.co/Qwen/Qwen3.5-0.8B-Base
The full collection is available at https://huggingface.co/collections/Qwen/qwen35.
Conclusion
The release of the Qwen3.5 small model series proves that compact LLMs no longer have to be “crippled” versions of their larger counterparts. The 0.8B/2B models target edge and on‑device deployment, the 4B model delivers unexpectedly strong multimodal and agent abilities, and the 9B model outperforms many 80B+ models across a range of benchmarks. For most users with sufficient GPU memory, the 9B model offers the best balance of performance and cost; the 4B model is the top choice for extreme lightweight scenarios, while the 27B model remains the go‑to option for stable, high‑capacity local deployment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
