Exploring Qwen 3.5: Small‑Scale MoE Models, Architecture, and Deployment Guides
This article reviews the three open‑source Qwen 3.5 models—including a 35B MoE, a 122B MoE, and a 27B dense version—detailing their parameter layouts, core attention designs, context length, inference performance, hardware requirements, and provides step‑by‑step code examples for loading them with Hugging Face Transformers and vLLM.
1. Model lineup and parameter scales
The Qwen 3.5 series offers three variants to suit different deployment constraints:
Qwen3.5‑35B‑A3B : MoE architecture with 350 B total parameters but only ~30 B active per inference, balancing knowledge capacity and speed for limited‑resource environments.
Qwen3.5‑122B‑A10B : Larger MoE model with 1,220 B total parameters and ~100 B active, targeting complex agent scheduling and multi‑step reasoning tasks.
Qwen3.5‑27B : Fully dense model with 270 B parameters fully activated, offering stable throughput for high‑concurrency scenarios.
2. Core architectural design
Hybrid attention : Alternates between Gated DeltaNet (linear attention) and Gated Attention , reducing memory usage for long contexts.
Fine‑grained expert system : Both MoE models use a “routing expert + shared expert” strategy to maintain baseline language ability while enabling specialized knowledge retrieval.
Multi‑step training & multimodal support : Incorporates Multi‑Token Prediction (MTP) for consistent text generation and early‑stage vision encoder integration for cross‑modal capabilities.
3. Context length and inference behavior
Context length : All three models natively support 256 K tokens.
Thinking mode : An optional mode that improves accuracy on math (e.g., MATH, GSM8K) and complex logic problems.
Inference efficiency : By limiting active parameters (3 B or 10 B), the MoE models achieve faster prefilling and long‑text generation compared with dense models of similar total size.
4. Comparative specifications
Key differences are summarized below:
Architecture : 35B‑A3B and 122B‑A10B use sparse MoE; 27B is dense.
Active parameters : ~3 B (35B‑A3B), ~10 B (122B‑A10B), full 27 B (27B).
Attention : MoE models employ Gated DeltaNet + Attention; dense model uses standard dense attention.
Strengths : MoE models reduce memory footprint and boost speed; the 122B‑A10B excels at complex reasoning; the dense 27B offers high‑throughput without routing overhead.
5. Deployment examples
The models are compatible with popular open‑source ecosystems. Below are concise code snippets for two typical scenarios.
5.1 Loading with Hugging Face Transformers (local testing)
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# Example: Qwen3.5‑35B‑A3B
model_name = "Qwen/Qwen3.5-35B-A3B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
prompt = "请简述量子计算的基本原理。"
messages = [{"role": "system", "content": "你是一个有用的助手。"}, {"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(**model_inputs, max_new_tokens=512)
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)5.2 Using vLLM for production‑grade high‑throughput
from vllm import LLM, SamplingParams
# Example: Qwen3.5‑27B (for larger models set tensor_parallel_size accordingly)
model_name = "Qwen/Qwen3.5-27B"
llm = LLM(
model=model_name,
trust_remote_code=True,
tensor_parallel_size=1, # increase for multi‑GPU setups
dtype="bfloat16",
)
sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=512)
prompts = [
"用 Python 写一个快速排序算法。",
"解释一下什么是大语言模型的混合专家架构(MoE)。",
]
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
print(f"Prompt: {output.prompt}
Generated: {output.outputs[0].text}
")5.3 Hardware recommendations (bfloat16, no KV‑cache, unquantized)
Qwen3.5‑27B (Dense): ~54 GB VRAM – use two 32 GB GPUs or a single 80 GB GPU (A100/H100).
Qwen3.5‑35B‑A3B (MoE): ~70 GB VRAM – one 80 GB GPU or four 24 GB consumer GPUs (RTX 4090/3090).
Qwen3.5‑122B‑A10B (MoE): ~244 GB VRAM – at least four 80 GB GPUs with tensor‑parallel deployment.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
