Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×
Alibaba’s Qwen3.5-397B-A17B, a 397‑billion‑parameter open‑source multimodal LLM, combines mixed linear attention with a sparse MoE architecture to achieve 8.6‑19× higher decoding throughput than Qwen3‑Max, supports 201 languages, and can be deployed via vLLM, Docker, Transformers, or SGLang with various optimization presets.
On February 16, the Alibaba Tongyi Qwen team released the first open‑source model of the Qwen3.5 series, Qwen3.5‑397B‑A17B, a 397‑billion‑parameter LLM that uses a mixed linear‑attention and sparse MoE architecture. The design delivers decoding throughput that is 8.6 to 19 times higher than the earlier Qwen3‑Max while retaining multimodal capabilities.
The model natively supports vision‑language tasks through early‑fusion training. In inference, encoding, agent tasks, and visual‑understanding benchmarks it outperforms the previous Qwen3‑VL series, and users have reported programming performance that surpasses Gemini 3 Pro.
Architecturally, Qwen3.5‑397B incorporates a gated Delta network together with a sparse‑expert mixture, and it is trained with large‑scale reinforcement‑learning environment extensions. The team claims that multimodal training efficiency is close to that of pure‑text training, and an asynchronous RL framework supports large‑scale agent scaffolding and environment orchestration.
The model covers 201 languages and dialects, is released under the Apache 2.0 license, and its weights are available on Hugging Face and ModelScope. It can be run locally via Transformers, llama.cpp, MLX, or accessed through the official Qwen Chat and Alibaba Cloud Model Studio APIs.
Quick deployment guide
Environment setup
# Create virtual environment
uv venv
source .venv/bin/activate
# Install vLLM
uv pip install -U vllm \
--torch-backend=auto \
--extra-index-url https://wheels.vllm.ai/nightlyDocker deployment
docker run --gpus all \
-p 8000:8000 \
--ipc=host \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:qwen3_5 Qwen/Qwen3.5-397B-A17B \
--tensor-parallel-size 8 \
--reasoning-parser qwen3 \
--enable-prefix-cachingOptimized configurations for different scenarios
Pure‑text high‑throughput – skip the visual encoder to save memory:
vllm serve Qwen/Qwen3.5-397B-A17B \
--tensor-parallel-size 8 \
--language-model-only \
--reasoning-parser qwen3 \
--enable-prefix-cachingMultimodal workloads – enable image‑text mixed processing with data‑parallel optimization:
vllm serve Qwen/Qwen3.5-397B-A17B \
--tensor-parallel-size 8 \
--mm-encoder-tp-mode data \
--mm-processor-cache-type shm \
--reasoning-parser qwen3 \
--enable-prefix-cachingLow‑latency scenarios – activate MTP‑1 speculative decoding for real‑time interaction:
vllm serve Qwen/Qwen3.5-397B-A17B \
--tensor-parallel-size 8 \
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
--reasoning-parser qwen3Multi‑node deployment – configuration for high‑end hardware such as GB200:
Master node:
vllm serve Qwen/Qwen3.5-397B-A17B \
--tensor-parallel-size 8 \
--reasoning-parser qwen3 \
--enable-prefix-caching \
--attention-backend FLASH_ATTN \
--nnodes 2 \
--node-rank 0 \
--master-addr <head_node_ip>Worker node:
vllm serve Qwen/Qwen3.5-397B-A17B \
--tensor-parallel-size 8 \
--reasoning-parser qwen3 \
--enable-prefix-caching \
--attention-backend FLASH_ATTN \
--nnodes 2 \
--node-rank 1 \
--master-addr <head_node_ip> \
--headlessOther framework support
Transformers – direct usage:
# Start service
transformers serve --port 8000 --continuous-batching
# Command‑line interaction
transformers chat Qwen/Qwen3.5-397B-A17BSGLang deployment :
python -m sglang.launch_server \
--model-path Qwen/Qwen3.5-397B-A17B \
--port 8000 \
--tp-size 8 \
--context-length 262144 \
--reasoning-parser qwen3The vLLM team highlights several inference advantages: the gated Delta network combined with the sparse MoE design yields high throughput, low latency, and lower cost; native support for 201 languages; and a single model that handles both text and vision without a separate vision‑language pipeline.
Developers have asked about a smaller 2‑B‑parameter version, and some users observed that activation parameters dropped from 22 B to 17 B, though hardware requirements have not been disclosed. Unsloth AI released a GGUF‑quantized version to facilitate local execution.
In benchmark tests, Qwen3.5‑397B competes with mainstream models such as GPT‑5.2 and Claude Opus 4.5 on instruction‑following and graduate‑level reasoning tasks. Analysts note that breakthroughs in multilingual and coding performance by open‑source models could shift the current dominance of closed‑source offerings.
Related links
GitHub repository: https://github.com/QwenLM/Qwen3.5
Online demo: https://chat.qwen.ai
Technical blog: https://qwen.ai/blog?id=qwen3.5
vLLM deployment guide: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
