Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×

Alibaba’s Qwen3.5-397B-A17B, a 397‑billion‑parameter open‑source multimodal LLM, combines mixed linear attention with a sparse MoE architecture to achieve 8.6‑19× higher decoding throughput than Qwen3‑Max, supports 201 languages, and can be deployed via vLLM, Docker, Transformers, or SGLang with various optimization presets.

AI Engineering
AI Engineering
AI Engineering
Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×

On February 16, the Alibaba Tongyi Qwen team released the first open‑source model of the Qwen3.5 series, Qwen3.5‑397B‑A17B, a 397‑billion‑parameter LLM that uses a mixed linear‑attention and sparse MoE architecture. The design delivers decoding throughput that is 8.6 to 19 times higher than the earlier Qwen3‑Max while retaining multimodal capabilities.

The model natively supports vision‑language tasks through early‑fusion training. In inference, encoding, agent tasks, and visual‑understanding benchmarks it outperforms the previous Qwen3‑VL series, and users have reported programming performance that surpasses Gemini 3 Pro.

Architecturally, Qwen3.5‑397B incorporates a gated Delta network together with a sparse‑expert mixture, and it is trained with large‑scale reinforcement‑learning environment extensions. The team claims that multimodal training efficiency is close to that of pure‑text training, and an asynchronous RL framework supports large‑scale agent scaffolding and environment orchestration.

The model covers 201 languages and dialects, is released under the Apache 2.0 license, and its weights are available on Hugging Face and ModelScope. It can be run locally via Transformers, llama.cpp, MLX, or accessed through the official Qwen Chat and Alibaba Cloud Model Studio APIs.

Quick deployment guide

Environment setup

# Create virtual environment
uv venv
source .venv/bin/activate

# Install vLLM
uv pip install -U vllm \
    --torch-backend=auto \
    --extra-index-url https://wheels.vllm.ai/nightly

Docker deployment

docker run --gpus all \
  -p 8000:8000 \
  --ipc=host \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  vllm/vllm-openai:qwen3_5 Qwen/Qwen3.5-397B-A17B \
    --tensor-parallel-size 8 \
    --reasoning-parser qwen3 \
    --enable-prefix-caching

Optimized configurations for different scenarios

Pure‑text high‑throughput – skip the visual encoder to save memory:

vllm serve Qwen/Qwen3.5-397B-A17B \
  --tensor-parallel-size 8 \
  --language-model-only \
  --reasoning-parser qwen3 \
  --enable-prefix-caching

Multimodal workloads – enable image‑text mixed processing with data‑parallel optimization:

vllm serve Qwen/Qwen3.5-397B-A17B \
  --tensor-parallel-size 8 \
  --mm-encoder-tp-mode data \
  --mm-processor-cache-type shm \
  --reasoning-parser qwen3 \
  --enable-prefix-caching

Low‑latency scenarios – activate MTP‑1 speculative decoding for real‑time interaction:

vllm serve Qwen/Qwen3.5-397B-A17B \
  --tensor-parallel-size 8 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' \
  --reasoning-parser qwen3

Multi‑node deployment – configuration for high‑end hardware such as GB200:

Master node:

vllm serve Qwen/Qwen3.5-397B-A17B \
  --tensor-parallel-size 8 \
  --reasoning-parser qwen3 \
  --enable-prefix-caching \
  --attention-backend FLASH_ATTN \
  --nnodes 2 \
  --node-rank 0 \
  --master-addr <head_node_ip>

Worker node:

vllm serve Qwen/Qwen3.5-397B-A17B \
  --tensor-parallel-size 8 \
  --reasoning-parser qwen3 \
  --enable-prefix-caching \
  --attention-backend FLASH_ATTN \
  --nnodes 2 \
  --node-rank 1 \
  --master-addr <head_node_ip> \
  --headless

Other framework support

Transformers – direct usage:

# Start service
transformers serve --port 8000 --continuous-batching

# Command‑line interaction
transformers chat Qwen/Qwen3.5-397B-A17B

SGLang deployment :

python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-397B-A17B \
  --port 8000 \
  --tp-size 8 \
  --context-length 262144 \
  --reasoning-parser qwen3

The vLLM team highlights several inference advantages: the gated Delta network combined with the sparse MoE design yields high throughput, low latency, and lower cost; native support for 201 languages; and a single model that handles both text and vision without a separate vision‑language pipeline.

Developers have asked about a smaller 2‑B‑parameter version, and some users observed that activation parameters dropped from 22 B to 17 B, though hardware requirements have not been disclosed. Unsloth AI released a GGUF‑quantized version to facilitate local execution.

In benchmark tests, Qwen3.5‑397B competes with mainstream models such as GPT‑5.2 and Claude Opus 4.5 on instruction‑following and graduate‑level reasoning tasks. Analysts note that breakthroughs in multilingual and coding performance by open‑source models could shift the current dominance of closed‑source offerings.

Related links

GitHub repository: https://github.com/QwenLM/Qwen3.5

Online demo: https://chat.qwen.ai

Technical blog: https://qwen.ai/blog?id=qwen3.5

vLLM deployment guide: https://docs.vllm.ai/projects/recipes/en/latest/Qwen/Qwen3.5.html

inference optimizationvLLMLarge Language Modelmultimodal LLMsparse MoEQwen3.5
AI Engineering
Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.