Artificial Intelligence 7 min read

How Alibaba’s New Qwen3.5‑Omni, Wan2.7‑Image, and Qwen3.6‑Plus Redefine Multimodal AI

Alibaba unveiled three cutting‑edge models—Qwen3.5‑Omni with native multimodal interaction, Wan2.7‑Image for high‑precision image generation and editing, and Qwen3.6‑Plus boosting coding agent performance—each achieving dozens of SOTA benchmarks, massive context windows, and novel capabilities such as Audio‑Visual Vibe Coding and transparent layer separation.

SuanNi

Apr 2, 2026

How Alibaba’s New Qwen3.5‑Omni, Wan2.7‑Image, and Qwen3.6‑Plus Redefine Multimodal AI

Overview

In a three‑day rollout, Alibaba released three heavyweight AI models: Qwen3.5‑Omni, Wan2.7‑Image, and Qwen3.6‑Plus. The announcement highlights their architectural upgrades, multimodal capabilities, and benchmark‑level performance across text, image, audio, and video tasks.

Qwen3.5‑Omni: Native Multimodal Interaction

Qwen3.5‑Omni features a comprehensive architecture upgrade that enables seamless understanding of text, images, audio, and video, as well as generation of timestamped subtitles. The research team discovered an emergent ability called Audio‑Visual Vibe Coding, allowing the model to generate Python code or front‑end prototypes directly from visual logic and spoken commands.

The model separates the roles of a “thinker” (understanding) and an “expressor” (generation). Both components are implemented as mixture‑of‑experts (MoE) specialists for audio, video, and text, ensuring each modality operates without interference and retains the strength of single‑modal experts.

Key specifications include a 256K token window, support for 113 languages, semantic interruption, voice cloning, voice control, native web search, and complex function calling. Across audio and video analysis, reasoning, dialogue, and translation, Qwen3.5‑Omni achieved 215 industry‑leading (SOTA) results, surpassing Gemini‑3.1 Pro in overall multimodal understanding while matching Qwen3.5’s text performance.

Wan2.7‑Image: Precision Image Generation and Editing

Wan2.7‑Image pushes image generation precision to a new level, moving beyond the “standard AI face” to enable fully customized portraits. The model supports up to nine reference images per generation, ensuring consistent character features across complex scenes.

It can produce up to twelve storyboard images with a unified style in a single request, and offers fine‑grained color control for stable and clear typography even when generating up to 4,000 characters of text within an image.

Local editing is possible through a “point‑and‑click” interface, and a transparent‑channel smart layer separation feature simplifies downstream image processing. An example showcases the generation of the first 40 chapters of the Dao De Jing as stylized calligraphy.

Qwen3.6‑Plus: Next‑Level Coding Agent

Qwen3.6‑Plus raises the performance of coding agents, achieving open‑source SOTA results in front‑end page generation, code repair, and terminal automation benchmarks. The model’s end‑to‑end success rate on code tasks has noticeably improved, and its tool‑calling and code‑generation reliability are higher than previous versions.

It offers a massive context window of up to one million tokens and stronger multimodal perception and visual understanding, enabling seamless transitions from natural speech to code, precise image output, and handling of extensive programming contexts.