Artificial Intelligence 11 min read

Why Qwen3.7-Max Is Sending Overseas Developers Into a Frenzy

Qwen3.7-Max demonstrates product‑level long‑task autonomy with 35 hours of uninterrupted operation, 1,158 tool calls, and kernel‑level optimizations, while outperforming Gemini 3.5‑Flash, Claude Opus, and GPT‑5.5 across a wide range of benchmarks, cost‑effectiveness, and real‑world agent scenarios.

SuanNi

May 22, 2026

Why Qwen3.7-Max Is Sending Overseas Developers Into a Frenzy

Product‑Level Long‑Task Autonomy

Qwen3.7‑Max ran for 35 hours continuously, performed 1,158 tool calls, and completed kernel optimizations without disconnection.

Community Benchmarks and Cost Comparison

Developers compared Qwen3.7‑Max with Gemini 3.5‑Flash and reported disappointment with Google. Independent experiments showed Qwen3.7‑Max outperforming Claude Opus 4.7 and GPT‑5.5, costing nine times less than Claude and two times less than GPT.

Comprehensive Benchmark Results

Programming Agent: Terminal Bench 2.0 score 69.7 (vs. DS‑V4‑Pro Max 67.9); SWE‑Pro 60.6 (top of the field); SWE‑Multilingual 78.3; SciCode 53.5; QwenSVG 1608.

General Agent: MCP‑Mark 60.8 (vs. GLM‑5.1 57.5); MCP‑Atlas 76.4 (vs. Opus‑4.6 75.8); Skillsbench 59.2 (vs. K2.6 56.2); Kernel Bench L3 1.98× median speedup with 96 % scenario success.

Reasoning: GPQA Diamond 92.4 (vs. Opus‑4.6 91.3); HLE 41.4 (lead); HMMT Feb 2026 97.1; IMOAnswerBench 90.0; Apex 44.5 (top scores).

Multilingual & General Ability: IFBench 79.1 (lead); WMT24++ 85.8; MAXIFE 89.2; PolyMATH 86.5; MRCR‑v2 128k 90.4 (vs. Qwen3.6‑Plus 85.9).

These results span multiple agent frameworks such as Claude Code, OpenClaw, and Qwen Code.

Environment Scaling and Generalization

Building on Qwen3.5’s environment scaling, Qwen3.7‑Max expands training environments. The rollout infrastructure decomposes each training instance into orthogonal Task, Harness, and Verifier components, enabling free recombination and forcing the model to learn universal problem‑solving strategies. Performance on QwenClawBench and CoWorkBench remains stable across different harnesses, and the average ranking steadily rises toward Claude‑4.6‑Opus‑Max.

35‑Hour Autonomous Kernel Optimization

Experiment: optimize the Extend Attention kernel from SGLang on an unseen T‑Head ZW‑M890 PPU ECS instance. Starting from a task description, the original Triton implementation, and an evaluation script, the model executed 1,158 tool calls, evaluated 432 kernel versions, wrote code, compiled, ran, analyzed bottlenecks, refactored, and fixed bugs without human intervention. After ~30 hours it continued to find improvements, achieving a geometric‑mean speedup of 10× over the original Triton code. Competing models: GLM 5.1 (7.3×), Kimi K2.6 (5.0×), DeepSeek V4 Pro (3.3×), Qwen3.6‑Plus (1.1×).

On NVIDIA GPUs, KernelBench L3 produced accelerated kernels in 96 % of scenarios (Opus‑4.6 98 %, GLM 5.1 78 %, K2.6 80 %, DS‑V4‑Pro 54 %).

Long‑Chain Planning and Self‑Evolution

RL monitoring experiment for software‑engineering tasks ran >80 hours. Qwen3.7‑Max autonomously retrieved and replayed training trajectories, executed >10,000 tool calls, identified cheating patterns (e.g., fetching answers from GitHub), added 13 heuristic rules, and flagged 1,618 cheating cases, stabilizing reward signals.

YC‑Bench entrepreneurship simulation: the model generated $2.08 M revenue (double Qwen3.6‑Plus, ~6× Qwen3.5‑Plus) across 237 tasks, autonomously scouting customers, blocking malicious actors, maintaining profit under rising labor costs, and preserving consistent strategy without context drift.

Real‑World Applications

Integration of Model Context Protocol (MCP) enables autonomous academic paper formatting (layout, headings, fonts, margins, tables of contents, references).

Front‑end development: a single prompt can generate Three.js 3D scenes, Canvas animations, full page layouts, and dynamic SVGs (e.g., hand‑gesture‑controlled particle systems).

Game development: one sentence can produce a complete 3D racing game.

Robot control: using Qwen‑RobotClaw framework and Qwen‑RobotNav navigation model, the agent performs perception, planning, memory, and decision‑making for robot dogs in physical environments.

Deployment and API Compatibility

Qwen3.7‑Max is available on Alibaba Cloud Model Studio, supporting OpenAI and Anthropic API protocols, and can be integrated with Claude Code, OpenClaw, and Qwen Code. It provides a preserve_thinking capability that retains prior reasoning across multi‑turn agent tasks.

Reference: https://qwen.ai/blog?id=qwen3.7

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agent large language model benchmark kernel optimization Qwen3.7-Max environment scaling

Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.