Inside Qwen3.5: The World’s Strongest Open‑Source Multimodal Model and Its Core Features
Qwen3.5‑397B‑A17B, the newly open‑sourced multimodal giant, combines a 400‑billion‑parameter sparse MoE architecture with FP8 pipelines and an asynchronous RL framework to deliver GPT‑5.2‑level capabilities, 60% lower memory usage, up to 19× higher throughput, and extensive image, video, and agent support, while outlining its deployment requirements and API pricing.
On Chinese New Year’s Eve, Alibaba’s Qwen team released Qwen3.5‑397B‑A17B, an open‑source multimodal large model that supports image and video inputs and matches the average performance of GPT‑5.2 and Gemini 3.0 Pro across dialogue, reasoning, programming, and agent construction.
1. Core Features
1.1 Pre‑training upgrades
Power – Trained on a larger visual‑text corpus with increased proportions of Chinese, multilingual, STEM, and reasoning data, applying stricter filtering; performance matches the 1‑trillion‑parameter Qwen3‑Max‑Base.
Efficiency – Built on the Qwen3‑Next architecture, it introduces higher‑sparsity MoE, Gated DeltaNet + Gated Attention hybrid attention, and multi‑token prediction, achieving 8.6×/19.0× decoding throughput over Qwen3‑Max at 32K/256K context lengths and 3.5×/7.2× over Qwen3‑235B‑A22B.
Versatility – Early text‑visual fusion and expanded visual/STEM/video data give native multimodal abilities; language support grows from 119 to 201 languages/dialects, and the vocabulary expands from 150 k to 250 k tokens, yielding 10%‑60% encoding/decoding efficiency gains.
1.2 Heterogeneous infrastructure
Decoupled parallelism – Separate parallel strategies for visual and language components avoid inefficiencies of a unified scheme, achieving near‑100% training throughput on mixed text‑image‑video data compared with a pure‑text baseline.
Native FP8 pipeline – Activations, MoE routing, and GEMM use FP8 precision while sensitive layers retain BF16, cutting activation memory by ~50% and adding >10% speedup; this design scales to trillions of tokens.
Scalable asynchronous RL framework – A train‑inference‑separated async RL system covers text, multimodal, and multi‑turn interaction, providing dynamic load balancing, fine‑grained fault recovery, and techniques such as FP8 training‑inference, rollout routing, speculative sampling, and multi‑round rollout locking, delivering 3×–5× end‑to‑end acceleration and improving training stability.
2. Architecture and Deployment
Parameter scale – Approximately 400 billion total parameters with an extreme sparse MoE; only 17 billion parameters are activated per inference, dramatically reducing compute cost.
Technical innovations – Introduces an attention gating mechanism (NeurIPS 2025 Best Paper) that lowers cost and boosts efficiency; compared with the commercial Qwen3‑Max‑Thinking (≈1 trillion parameters), Qwen3.5‑397B‑A17B reduces deployment memory by 60% and increases maximum inference throughput by 19×.
Hardware requirements – Full operation needs at least an 8‑GPU A100 (80 GB) cluster, though the 17 B active‑parameter design yields high inference efficiency.
2.1 Inference mode and context
Mixed inference model – Qwen3.5 defaults to inference mode; it does not support the <no_thinking> prefix. Switching to chat mode requires editing the built‑in prompt template in tokenizer_config.json.
Extended context – Default context length is 256 K tokens, configurable up to 1 M tokens, enabling processing of roughly two hours of video in a single pass.
2.2 Performance highlights
All‑round capability – Matches GPT‑5.2, Gemini 3.0 Pro, and Claude Opus 4.5 on dialogue, coding, visual recognition, and agent building; coding performance is within ~10% of the latest GPT‑5.3‑CodeX and Claude Opus 4.6.
Native multimodal advantage – Processes text, image, and video in a unified semantic space, directly handling mixed‑layout PDFs without a separate RAG pipeline; visual reasoning is strong enough to accurately interpret complex agent architecture diagrams and even generate equivalent code.
2.3 Open‑source and API
Model download – Fully open‑sourced; weights are available on ModelScope and Hugging Face.
API services – Alibaba Baichuan platform offers two variants: the pure open‑source model (Qwen3.5) and Qwen3.5‑Plus, which adds basic tools (e.g., web search) to form a general‑purpose agent.
Pricing – Input costs 0.8 CNY per million tokens and output 4.8 CNY per million tokens, roughly 1/18 the cost of a comparable Gemini 3.0 Pro service.
3. Conclusion
Qwen3.5‑397B‑A17B launches the Chinese multimodal model race for the 2026 Spring season, positioning domestic models alongside leading global systems and setting a foundation for future multimodal agents.
Fun with Large Models
Master's graduate from Beijing Institute of Technology, published four top‑journal papers, previously worked as a developer at ByteDance and Alibaba. Currently researching large models at a major state‑owned enterprise. Committed to sharing concise, practical AI large‑model development experience, believing that AI large models will become as essential as PCs in the future. Let's start experimenting now!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
