14 min read

Qwen3.7-Plus: Deep Reasoning, Visual Understanding, and End‑to‑End Multimodal Execution

Qwen3.7-Plus is a multimodal large‑model that unifies vision and language, delivers top‑5 global Vision Arena rankings, excels on a wide range of pure‑text, visual‑reasoning, and video benchmarks, and powers autonomous agents that perceive screens, generate code, and complete complex GUI/CLI workflows end‑to‑end.

Alibaba Cloud Developer

Jun 3, 2026

Qwen3.7-Plus: Deep Reasoning, Visual Understanding, and End‑to‑End Multimodal Execution

Overview

Qwen3.7-Plus is the latest multimodal model from Alibaba, extending the strong textual abilities of Qwen3.7 with a comprehensive upgrade of visual‑language capabilities while retaining full agent functions for coding, tool use, and productivity workflows.

Core Multimodal Agent Capabilities

Multimodal Agent: Handles images, video, screen captures, web pages, and text, and can act in GUI, CLI, or tool environments.

Visual Agent: Combines visual understanding, a code interpreter, and search‑enhancement to solve visual puzzles, real‑world Q&A, and complex reasoning tasks.

Visual Coding: Generates SVG, web pages, and interactive front‑end code directly from visual references.

GUI Agent: Understands mobile and desktop interfaces, performs widget localization, task planning, and multi‑step operations.

Real‑world Perception & Reasoning: Covers documents, charts, OCR, video, and driving‑scene understanding.

Benchmark Performance

Pure‑text benchmarks – Qwen3.7‑Plus approaches Max‑level models on standard text tests and shows strong results on Terminal Bench 2.0, the SWE‑bench suite, and SciCode, indicating robust software‑engineering and scientific‑programming abilities.

General Agent benchmarks – It achieves stable tool‑use and planning performance on MCP‑Mark, Deep‑Planning, and Kernel Bench L3, especially excelling in multi‑step planning and GPU‑kernel optimization.

Reasoning benchmarks – GPQA Diamond, HMMT, and IMOAnswerBench place the model among the leading Plus‑level systems on high‑difficulty STEM tests.

Instruction‑following & multilingual tasks – IFBench, WMT24++, and PolyMATH show consistent high‑quality results across many languages and domains.

Multimodal benchmarks – On Vision Arena the model ranks in the global top‑5 and first in China. It also leads on BabyVision, MathVision, HiPhO, ERQA, VisFactor (visual reasoning), ScreenSpot Pro, OSWorld‑Verified, AndroidWorld (screen understanding), QwenVision2Code (visual‑to‑code), VideoMMMU, MLVU, TVBench, LVBench (video), and LingoQA, Ego3D‑Bench, SURDS, VLADBench (driving scenes).

Key Capability Improvements

In Multimodal Reasoning , Qwen3.7‑Plus surpasses Qwen3.6‑Plus on BabyVision, demonstrating stronger early‑visual‑cognition and spatial‑reasoning generalization.

In Visual Agent & Coding , it markedly improves on ScreenSpot Pro, OSWorld‑Verified, and AndroidWorld, enabling accurate UI element localization, intent understanding, and multi‑step interaction.

In Multimodal Search & Knowledge QA , performance gains on SimpleVQA, WorldVQA, MMSearchPlus, BC‑VL, and MMBC show effective fusion of visual input with external knowledge retrieval.

In General Visual Understanding , strong results on RealWorldQA, CountQA, OmniDocBench, CharXiv, OCR‑Bench‑V2 confirm stable handling of real‑world images, documents, charts, and OCR tasks.

Case Studies

Hybrid‑Agent full‑stack app development – Using Qwen3.7‑Plus, a hybrid agent autonomously built an English‑vocabulary learning app from requirement analysis to version iteration. The agent ran continuously for 11+ hours, generated >10,000 lines of code, invoked the model >1,000 times, and covered the entire software‑development lifecycle (spec generation, code writing, deployment, testing, UI automation, multi‑scenario testing, documentation updates, and version upgrades).

Desktop Stocks app replication – The agent recreated macOS native Stocks application end‑to‑end: it perceived the UI, generated SwiftUI source code, integrated a real‑time market‑data API, compiled and launched the app, and passed ten functional verification tests (price loading, stock switching, multi‑period view, search/filter, detailed panels, etc.).

Browser intelligent assistant – Integrated into the Qwen for Chrome extension, the browser agent perceives current web pages, plans tasks, and executes clicks, inputs, navigation, and verification. In a demonstration, it purchased the cheapest ECS server from a cloud console, handled price changes and stock limits, then performed instance scaling and maintenance operations without user intervention.

Visual programming demos – The model solved “find‑the‑differences”, jigsaw, puzzle, and maze tasks by converting visual problems into executable code, performed search‑enhanced visual QA, generated SVG from images/videos, and created full interactive web pages from design references.

Conclusion

Qwen3.7‑Plus represents the most capable multimodal agent model from Alibaba, unifying visual perception, language reasoning, and autonomous execution across GUI and CLI environments. Its cross‑framework generalization (Claude Code, OpenClaw, Qwen Code, etc.) ensures stable performance regardless of deployment stack, and the model invites community contributions to further expand multimodal applications.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI code generation Visual Reasoning benchmark performance Agent Automation qwen3.7-plus

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.