Exploring Qwen 3.5: Small‑Scale MoE Models, Architecture, and Deployment Guides

This article reviews the three open‑source Qwen 3.5 models—including a 35B MoE, a 122B MoE, and a 27B dense version—detailing their parameter layouts, core attention designs, context length, inference performance, hardware requirements, and provides step‑by‑step code examples for loading them with Hugging Face Transformers and vLLM.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Exploring Qwen 3.5: Small‑Scale MoE Models, Architecture, and Deployment Guides

1. Model lineup and parameter scales

The Qwen 3.5 series offers three variants to suit different deployment constraints:

Qwen3.5‑35B‑A3B : MoE architecture with 350 B total parameters but only ~30 B active per inference, balancing knowledge capacity and speed for limited‑resource environments.

Qwen3.5‑122B‑A10B : Larger MoE model with 1,220 B total parameters and ~100 B active, targeting complex agent scheduling and multi‑step reasoning tasks.

Qwen3.5‑27B : Fully dense model with 270 B parameters fully activated, offering stable throughput for high‑concurrency scenarios.

2. Core architectural design

Hybrid attention : Alternates between Gated DeltaNet (linear attention) and Gated Attention , reducing memory usage for long contexts.

Fine‑grained expert system : Both MoE models use a “routing expert + shared expert” strategy to maintain baseline language ability while enabling specialized knowledge retrieval.

Multi‑step training & multimodal support : Incorporates Multi‑Token Prediction (MTP) for consistent text generation and early‑stage vision encoder integration for cross‑modal capabilities.

3. Context length and inference behavior

Context length : All three models natively support 256 K tokens.

Thinking mode : An optional mode that improves accuracy on math (e.g., MATH, GSM8K) and complex logic problems.

Inference efficiency : By limiting active parameters (3 B or 10 B), the MoE models achieve faster prefilling and long‑text generation compared with dense models of similar total size.

4. Comparative specifications

Key differences are summarized below:

Architecture : 35B‑A3B and 122B‑A10B use sparse MoE; 27B is dense.

Active parameters : ~3 B (35B‑A3B), ~10 B (122B‑A10B), full 27 B (27B).

Attention : MoE models employ Gated DeltaNet + Attention; dense model uses standard dense attention.

Strengths : MoE models reduce memory footprint and boost speed; the 122B‑A10B excels at complex reasoning; the dense 27B offers high‑throughput without routing overhead.

5. Deployment examples

The models are compatible with popular open‑source ecosystems. Below are concise code snippets for two typical scenarios.

5.1 Loading with Hugging Face Transformers (local testing)

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# Example: Qwen3.5‑35B‑A3B
model_name = "Qwen/Qwen3.5-35B-A3B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

prompt = "请简述量子计算的基本原理。"
messages = [{"role": "system", "content": "你是一个有用的助手。"}, {"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(**model_inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)

5.2 Using vLLM for production‑grade high‑throughput

from vllm import LLM, SamplingParams

# Example: Qwen3.5‑27B (for larger models set tensor_parallel_size accordingly)
model_name = "Qwen/Qwen3.5-27B"
llm = LLM(
    model=model_name,
    trust_remote_code=True,
    tensor_parallel_size=1,  # increase for multi‑GPU setups
    dtype="bfloat16",
)

sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=512)
prompts = [
    "用 Python 写一个快速排序算法。",
    "解释一下什么是大语言模型的混合专家架构(MoE)。",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    print(f"Prompt: {output.prompt}
Generated: {output.outputs[0].text}
")

5.3 Hardware recommendations (bfloat16, no KV‑cache, unquantized)

Qwen3.5‑27B (Dense): ~54 GB VRAM – use two 32 GB GPUs or a single 80 GB GPU (A100/H100).

Qwen3.5‑35B‑A3B (MoE): ~70 GB VRAM – one 80 GB GPU or four 24 GB consumer GPUs (RTX 4090/3090).

Qwen3.5‑122B‑A10B (MoE): ~244 GB VRAM – at least four 80 GB GPUs with tensor‑parallel deployment.

Qwen 3.5 model overview
Qwen 3.5 model overview
AImodel deploymentvLLMLarge Language ModelQwenMoETransformers
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.