Artificial Intelligence 13 min read

Qwen3.7-Max Sets New Agent Benchmarks – China’s New Model King

Alibaba’s Qwen3.7‑Max model tops multiple Arena leaderboards, achieves SOTA scores in programming, reasoning, and multilingual benchmarks, runs a 35‑hour autonomous coding task on a custom AI chip with 10× speedup, and demonstrates end‑to‑end desktop app creation and web‑search agents, illustrating a rapid monthly model‑iteration strategy.

Machine Heart

May 20, 2026

Qwen3.7-Max Sets New Agent Benchmarks – China’s New Model King

Benchmark Highlights

Qwen3.7‑Max and Qwen3.7‑Plus Preview were released as part of the Qwen3.7 Preview series. Arena’s latest blind‑test leaderboard placed Qwen3.7‑Max first among Chinese models in both text and vision domains, surpassing competitors such as Kimi‑K2.6, DeepSeek‑v4 Pro, and GLM‑5.1, and approaching the performance of GPT, Claude, and Gemini.

In programming agent evaluations, Qwen3.7‑Max achieved state‑of‑the‑art results on SWE‑Pro, SWE‑Multilingual, and scored a record 69.7 on Terminal Bench 2.0‑Terminus, beating DeepSeek‑v4‑pro‑Max and Claude‑Opus 4.6. For general‑purpose agents, it excelled on MCP‑Atlas, MCP‑Mark, and Skillbench, outperforming GLM‑5.1 and Kimi‑K2.6, and demonstrated strong GPU kernel optimization on Kernel Bench L3. In reasoning benchmarks (GPQA Diamond, HLE, HMMT 2026 Feb, IMOAnswerBench) it outperformed Claude‑Opus 4.6 and all other Chinese models. Multilingual capability was highlighted by an IFBench score of 79.1 and leading results on WMT24++ and MAXIFE.

Long‑Running Agent Demonstration

At the 2026 Alibaba Cloud summit, Qwen3.7‑Max was deployed on the custom “Zhangwu M890” AI chip. In a 35‑hour autonomous coding session, the model optimized a production‑grade attention kernel, achieving a 10× inference speedup over the SGLang Triton reference implementation. During this run the model performed 432 kernel evaluations and 1,158 tool calls, handling the entire workflow—from code generation to compilation, performance analysis, and iterative improvement—without human intervention.

End‑to‑End Desktop Application Creation

Using its native agent reasoning together with the Claude Code execution tool, Qwen3.7‑Max transformed a simple natural‑language request (“make a desktop Pomodoro timer”) into a complete Python + PyQt application. It automatically installed missing dependencies, resolved path‑related errors by generating alternative commands, and produced a functional .exe package after a single packaging command. The model also adapted the UI theme on demand with a single instruction.

CLI Tool Integration and Skill Scheduling

The model invoked the opencli utility to retrieve images of popular Cantonese dishes in Beijing from Xiaohongshu. It parsed the tool’s documentation, generated correct API calls, and dynamically adjusted network‑timeout settings when the request failed, ultimately downloading a full set of images and converting the results into a PPT‑style report.

When asked to rewrite a travelogue containing repetitive phrasing, Qwen3.7‑Max identified the core editing intent, applied a built‑in Skill, and output a markdown table that quantified improvements in directness and factuality.

Product Strategy and Ecosystem

Alibaba’s rapid monthly release cadence (Qwen3.5‑Max Preview → Qwen3.6‑Max Preview → Qwen3.7‑Max) is powered by the vertically integrated ATH organization, which combines custom chips, elastic cloud compute, and model‑iteration pipelines. Open‑source variants such as Qwen3.6‑27B and Qwen3.6‑35B‑A3B have topped HuggingFace rankings, delivering performance that exceeds larger dense models while keeping deployment costs low.

Qwen3.7‑Max also demonstrated cross‑framework generalization, running Claude Code, OpenClaw, and Hermes Agent without additional training, echoing the standard‑interface model of historic operating‑system ecosystems. The model’s growing adoption—evidenced by OpenRouter’s record 1.4 trillion token daily usage for Qwen3.6‑Plus—underscores its strategic role as the foundational layer for the emerging Agent era.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Alibaba Agent Large Language Model benchmark AI chip Qwen3.7-Max

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.