Step 3.7 Flash: High‑Efficiency Pro‑Level Agent Model with 400 TPS and Low Cost
Step 3.7 Flash is a 196B‑parameter, 11B‑activation multimodal agent model that delivers 400 TPS inference, superior code‑generation and cross‑framework stability, cost‑effective Advisor Mode, and strong vision and search performance, with extensive benchmark gains over its predecessor and competing models.
Model competition is shifting from sheer intelligence to practical, cost‑effective, reliable agents; users care about independent operation, affordable cost, and self‑recovery.
Code generation and cross‑framework stability
Step 3.7 Flash focuses on coding ability, achieving 56.3% on SWE‑Bench Pro (+5 pts over Step 3.5) and 59.6% on Terminal‑Bench 2.1 (+6.1 pts). It surpasses DeepSeek V4 Flash on both benchmarks. In cross‑framework tests across six harnesses (Claude Code, KiloCode, Hermes Agent, OpenClaw, OpenCode, RooCode) the model attains an average pass rate of 67.08% versus 56.50% for Step 3.5, with notable gains of 20 pts on OpenClaw and 21.5 pts on RooCode, demonstrating balanced performance across diverse toolchains.
Advisor Mode for cost‑effective execution
Inspired by Anthropic’s advisor strategy, Advisor Mode lets the small model act as the executor while a larger “advisor” intervenes only at critical planning or failure‑recovery points. With Advisor Mode, Step 3.7 Flash reaches 97% of Claude Opus 4.6’s coding performance at a per‑task cost of $0.19 versus $1.76 for Claude, a nine‑fold saving. Without Advisor Mode the cost drops to $0.12 with a 73.7% SWE‑Bench Verified score; enabling Advisor Mode raises the score to 76.3% (Claude Opus 4.6 scores 78.7%).
Native search and knowledge integration
Step 3.7 Flash treats search as an intrinsic part of reasoning, enhancing planning, evidence filtering, and information synthesis. Benchmarks show HLE with Tools at 47.2% (vs 35.7% for Step 3.5), BrowseComp at 75.8% (close to Claude Opus 4.7’s 79.3%), DeepSearchQA F1 = 92.8% (matching Kimi K2.6), and ResearchRubrics at 71.68% (above GPT 5.5’s 61.5%).
Vision capabilities
With a native 1.8B‑parameter ViT, the model processes visual inputs. It scores 79.16% on SimpleVQA (parity with much larger models), 58.10% on WorldVQA (ahead of Kimi K2.6, GLM 5V Turbo, GPT 5.5), 58.96% on BC‑VL, 95.29% on V* benchmarks, 89.13%/86.34% on HR‑Bench 4K/8K, and 65.05% on VisualProbe. Visual Search bridges knowledge gaps for long‑tail entities, enabling performance comparable to models five times larger.
Enterprise scenario: reliable, end‑to‑end execution
Step 3.7 Flash excels in autonomous task execution and vertical domain knowledge. It integrates intent understanding, multimodal perception, and agent execution without interruption. Tool orchestration scores rise to 49.5% on Toolathlon (+16 pts) and 67.1% on ClawEval‑1.1 (+23 pts). Long‑running stability is maintained across terminals, browsers, Office tools, and search engines. Vertical knowledge integration yields 45.8% on GDPval (vs 27.8% for Step 3.5) and 63.9% on AA‑LCR (vs 45.5%).
Deployment and open‑source release
The model is available on Stepfun’s global and China platforms, OpenRouter, NVIDIA NIM, and will appear on DeepInfra, Fireworks AI, Modal. It runs on cloud, data‑center, and on‑prem hardware (Mac Studio/Pro with ≥128 GB RAM, NVIDIA DGX Station, AMD Ryzen AI Max+ 395). Inference is compatible with vLLM, SGLang, Hugging Face Transformers, llama.cpp; development integrates NVIDIA NeMo, Megatron Core, Megatron Bridge, and can be served via NVIDIA NIM micro‑services. Weights and code are open‑sourced on GitHub, HuggingFace, and ModelScope.
Step 3.7 Flash delivers Pro‑level agent capabilities at a low cost (11 B activation parameters, $0.19 per task), pushing the cost‑performance frontier for agent models.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
