Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide

This article provides a detailed analysis of Qwen3.5, covering its multimodal MoE design, massive inference speedups, extensive benchmark results against GPT‑5.2, Claude 4.5 Opus and Gemini‑3 Pro, RL scaling strategies, training infrastructure innovations, and practical usage via API and local deployment.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Qwen3.5 Deep Dive: Multimodal Architecture, Benchmarks, and Deployment Guide

Model Overview

Qwen3.5‑397B‑A17B is a 397 B‑parameter mixture‑of‑experts (MoE) model that activates only 17 B parameters per forward pass, achieving low inference cost. The Plus variant (qwen3.5‑plus) adds a 1 M‑token context window, built‑in tools and adaptive tool usage.

Core Upgrades

Unified vision‑language base : Early‑fusion training on trillions of multimodal tokens aligns text capability with Qwen‑3 and surpasses Qwen‑3‑VL on vision, enabling a single model for reasoning, coding, agents and visual understanding.

Efficient sparse MoE : Gated Delta Networks + sparse MoE raise decoding throughput 8.6× at 32 K context and 19× at 256 K, a 3.5×–7.2× improvement over Qwen‑3‑Max‑235B‑A22B.

Scalable RL generalization : Post‑training expands to millions of RL environments with gradually increasing difficulty, improving robustness for real‑world agent tasks.

201‑language support : Vocabulary grows from 150 K to 250 K tokens, yielding 10‑60 % encoding‑decoding efficiency gains.

Next‑gen training infrastructure : Multimodal training efficiency approaches 100 % of pure‑text training; an asynchronous RL framework accelerates large‑scale agent tool calls 3–5×.

Architecture Innovations

Linear attention via Gated Delta Networks reduces compute for long sequences.

Higher‑sparsity MoE activates only 4.3 % of experts (17 B of 397 B parameters) per token.

These changes produce a massive inference efficiency boost (see inference‑efficiency comparison image).

Inference efficiency comparison
Inference efficiency comparison

Performance Evaluation

Language benchmarks (higher is better)

MMLU‑Pro: 87.8 (GPT‑5.2 87.4, Claude 4.5 89.5, Gemini‑3 89.8)

MMLU‑Redux: 94.9 (GPT‑5.2 95.0, Claude 95.6, Gemini‑3 95.9)

IFBench: 76.5 🥇 (GPT‑5.2 75.4, Gemini‑3 70.4)

MultiChallenge: 67.6 🥇 (GPT‑5.2 57.9, Gemini‑3 64.2)

LiveCodeBench v6: 83.6 (GPT‑5.2 87.7, Gemini‑3 90.7)

AIME26 (math): 91.3 (GPT‑5.2 96.7, Gemini‑3 90.6)

BrowseComp: 69.0/78.6 🥇 (Claude 67.8, Gemini‑3 59.2)

BrowseComp‑zh: 70.3 🥇 (Claude 62.4, Gemini‑3 66.8)

NOVA‑63 (multilingual): 59.1 🥇 (GPT‑5.2 54.6, Claude 56.7)

MAXIFE (multilingual): 88.2 🥇 (GPT‑5.2 88.4, Claude 79.2)

Vision benchmarks (higher is better)

MathVision: 88.6 🥇 (Gemini‑3 86.6)

MathVista: 90.3 🥇 (Gemini‑3 87.9)

We‑Math: 87.9 🥇 (Gemini‑3 86.9)

ZEROBench: 12 🥇 (next best 10)

MMStar (VQA): 83.8 🥇 (Gemini‑3 83.1)

OmniDocBench (document understanding): 90.8 🥇 (Gemini‑3 88.5)

OCRBench: 93.1 🥇 (Gemini‑3 90.4)

CountBench (spatial intelligence): 97.2 (Gemini‑3 97.3)

V* (spatial): 95.8 🥇 (Gemini‑3 88.0)

Strengths include instruction‑following, search‑agent performance, and broad multilingual coverage. Gaps: pure inference on LiveCodeBench and AIME26 lags behind GPT‑5.2, HLE score 28.7 is lowest among peers, and SWE‑bench Verified 76.4 is below GPT‑5.2 (80.0) and Claude 4.5 Opus (80.9).

RL Scaling

Post‑training expands RL environments at scale without benchmark‑specific tuning, progressively increasing task difficulty. The agent capability curve (see image) shows consistent improvement, eventually surpassing Qwen‑3‑Max‑Thinking on composite benchmarks BFCL‑V4, VITA‑Bench, DeepPlanning, Tool‑Decathlon and MCP‑Mark.

Agent capability improvement
Agent capability improvement

Infrastructure Highlights

Near‑100 % multimodal training efficiency : Heterogeneous infrastructure decouples visual and textual parallelism, eliminating typical efficiency loss.

Native FP8 training : Low‑precision activation, MoE routing and GEMM reduce activation memory by ~50 % and increase speed >10 %.

Asynchronous RL framework : Fully decoupled training‑inference pipeline supports million‑scale agent scaffolds and environment orchestration, delivering 3–5× end‑to‑end acceleration.

Usage

API access

Available via Alibaba Cloud DashScope (compatible with OpenAI and Anthropic APIs). Example:

from openai import OpenAI
import os

client = OpenAI(
    api_key=os.getenv("DASHSCOPE_API_KEY"),
    base_url="https://dashscope-intl.aliyuncs.com/compatible-mode/v1",
)

completion = client.chat.completions.create(
    model="qwen3.5-plus",
    messages=[{"role": "user", "content": "Introduce Qwen3.5."}],
    extra_body={"enable_thinking": True, "enable_search": False},
    stream=True,
)

Local deployment

Weights are released under Apache 2.0. Example commands:

# Hugging Face Transformers service
transformers serve --port 8000 --continuous-batching

# Command‑line chat
transformers chat Qwen/Qwen3.5-397B-A17B

# SGLang deployment
python -m sglang.launch_server --model-path Qwen/Qwen3.5-397B-A17B \
    --port 8000 --tp-size 8 --context-length 262144 --reasoning-parser qwen3

# vLLM deployment
vllm serve Qwen/Qwen3.5-397B-A17B --port 8000 \
    --tensor-parallel-size 8 --max-model-len 262144 --reasoning-parser qwen3

Apple Silicon users can use mlx‑lm (text‑only) or mlx‑vlm (vision + text); llama.cpp supports the GGUF version. Fine‑tuning is compatible with UnSloth, Swift, Llama‑Factory, etc.

Demo showcase

Web development: describe a UI requirement and the model generates front‑end code.

GUI agent: performs autonomous operations such as form filling and cross‑app coordination.

Visual programming: processes up to 1 M‑token inputs and 2‑hour videos, turning sketches into code or reverse‑engineering game logic.

Spatial intelligence: pixel‑level spatial relationship modeling for autonomous driving perception and robot navigation.

Conclusion

Native multimodal training—integrating vision and language from the pre‑training stage—delivers substantial gains in inference speed, multilingual coverage and agent capabilities, positioning Qwen3.5 competitively against GPT‑5.2, Claude 4.5 Opus and Gemini‑3 Pro.

multimodal AILarge Language Modelbenchmarkreinforcement learningFP8 trainingQwen3.5
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.