Artificial Intelligence 6 min read

Why Qwen3.5-Plus Sets a New Standard for Open-Source Multimodal AI

Qwen3.5-Plus, Alibaba’s newly open-sourced multimodal LLM, combines a 397 B parameter model with only 17 B active parameters, leveraging native multimodal training, gated attention, sparse MoE, and FP8 precision to outperform GPT-5.2 and Gemini-3-Pro across vision, reasoning, and agent benchmarks.

PaperAgent

Feb 16, 2026

Why Qwen3.5-Plus Sets a New Standard for Open-Source Multimodal AI

01 Native Multimodal, Not Simple Stitching

Traditional multimodal models often glue a pretrained text encoder with a visual adapter, training the visual module separately on image‑text pairs and then feeding visual tokens into a frozen language model. This approach limits the model’s ability to reason about spatial relationships or temporal dynamics because the language backbone’s knowledge is already fixed before visual information is introduced.

Qwen3.5-Plus breaks this pattern by training from scratch on a massive mixture of text, image, and video tokens that are randomly interleaved. The model learns both linguistic syntax and pixel‑level logical relations simultaneously, enabling it to understand a kitchen photo directly without first converting objects into textual concepts.