Claude Opus 4.7 Boosts Programming Performance by 11% – Why Its ‘No’ Makes It More Reliable

Claude Opus 4.7 raises SWE‑bench Pro accuracy from 53.4% to 64.3% (a +11 pp jump), triples visual resolution, can refuse or verify dubious instructions, and keeps pricing unchanged while increasing token consumption, positioning it as a more reliable AI colleague despite a slight dip in long‑document search.

ZhiKe AI
ZhiKe AI
ZhiKe AI
Claude Opus 4.7 Boosts Programming Performance by 11% – Why Its ‘No’ Makes It More Reliable

Programming performance

Opus 4.7 achieves 64.3% on SWE‑bench Pro, up from 53.4% in Opus 4.6 (+10.9 pp). On SWE‑bench Verified it scores 87.6% versus 80.8% (+6.8 pp). CursorBench improves from 58% to 70%.

Compared with GPT‑5.4 (57.7% on SWE‑bench Pro) and Gemini 3.1 Pro (54.2%), Opus 4.7 leads.

Key insight : the larger gain on the harder Pro benchmark indicates stronger handling of complex tasks.

Enterprise case studies

Cursor : on 93 programming tasks success rate rose by 13%, solving four tasks that 4.6 could not.

Rakuten : production‑grade bug fixes increased three‑fold.

XBOW : visual accuracy improved from 54.5% to 98.5% (+44 pp).

Notion : agent‑task success up 14% and tool‑error rate fell to one‑third.

Unattended Rust TTS engine

Opus 4.7 built a complete Rust text‑to‑speech engine from scratch without human intervention, demonstrating capability on long‑running, complex pipelines.

Visual capability

Maximum long‑side resolution increased from ~860 px to 2 576 px (≈3×). Total pixel count grew from ~1.2 MP to 3.75 MP (≈3×). XBOW visual test matches the resolution gain (54.5% → 98.5%).

Result: dense screenshots become readable for computer‑use agents; extraction of scientific charts, financial statements, and technical drawings gains precision.

Note : high‑resolution images consume more tokens; pre‑downsampling is advisable when ultra‑high precision is unnecessary.

Reliability – willingness to say “no”

When a user plan contains errors, Opus 4.6 executed blindly; Opus 4.7 points out the problem.

When information is missing, Opus 4.6 attempts to fill (may be wrong); Opus 4.7 reports the error directly.

During output generation, Opus 4.6 may fabricate answers; Opus 4.7 validates internally before returning.

“Claude 在技术讨论中会反驳我,帮我做出更好的决定。” – Replit
“Opus 4.7 遇到缺失数据会直接报错,而 4.6 往往会尝试填充可能错误的值。” – Hex
“它在编写系统级代码前会自行进行数学证明,确保逻辑严谨。” – Vercel

BrowseComp regression

Score dropped from 83.7% to 79.3% because the model now reports missing information instead of fabricating it, allowing GPT‑5.4 (89.3%) to overtake.

New features

xhigh effort level – granularity between high and max for finer trade‑off between reasoning depth and speed; default for Claude Code.

Auto Mode extension – model issues command → classifier evaluates safety; safe commands auto‑approved, risky commands fall back to the traditional permission prompt. Available to Max, Teams, Enterprise.

模型发出命令 → 分类器评估安全性
 ├── 安全 → 自动批准,无需用户介入
 └── 风险 → 保留传统权限提示

/ultrareview command – opens a dedicated session for code review, scanning changes and flagging bugs or design issues.

Task Budgets (API beta) – helps developers plan token spend for long‑running tasks.

Pricing

Input $5 / million tokens, output $25 / million tokens – unchanged from the previous version.

Migration considerations

New tokenizer increases token count by 1.0–1.35× for the same content.

High‑compute mode (“thinking”) raises output token count.

Stricter instruction compliance may require prompt adjustments.

Availability

All Claude product lines (model ID claude-opus-4-7 on Anthropic API).

Amazon Bedrock.

Google Cloud Vertex AI.

Microsoft Foundry.

Conclusion

Opus 4.7 leads in programming (+11 pp on SWE‑bench Pro), tool invocation, and domain‑specific tasks, while its stricter refusal policy causes a modest drop in long‑document search performance. Token consumption may rise up to 35%.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

reliabilitypricingAI benchmarkingClaude Opusprogramming performancevisual capability
ZhiKe AI
Written by

ZhiKe AI

We dissect AI-era technologies, tools, and trends with a hardcore perspective. Focused on large models, agents, MCP, function calling, and hands‑on AI development. No fluff, no hype—only actionable insights, source code, and practical ideas. Get a daily dose of intelligence to simplify tech and make efficiency tangible.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.