Claude Opus 4.7: Programming Power Peaks but Faces ‘Dumbing‑Down’ Criticism
Anthropic’s Claude Opus 4.7 launches with record‑breaking programming benchmarks, a new xhigh effort mode and a free 1 M‑token context window, yet an AMD audit reveals a steep drop in real‑world engineering accuracy, reduced cache TTL and a shift to usage‑based pricing that has sparked community backlash.
Release and headline performance
On 16 April 2026 Anthropic announced its flagship large model Claude Opus 4.7. The same day AMD senior director Stella Laurenzo published an audit report that highlighted a noticeable performance decline since February, sparking wide discussion in the developer community.
Programming benchmarks
Claude Opus 4.7 was evaluated on ten benchmarks, surpassing GPT‑5.4 on seven of them. Key results include:
SWE‑bench Verified: 87.6 % (Opus 4.7) vs 80.6 % (Gemini 3.1 Pro) – real GitHub issue fixes.
SWE‑bench Pro: 64.3 % (Opus 4.7) vs 57.7 % (GPT‑5.4) – open‑issue programming.
SWE‑bench Multilingual: 80.5 % (Opus 4.7) – multilingual code ability.
Terminal‑Bench 2.0: 69.4 % (Opus 4.7) – terminal engineering execution.
OSWorld‑Verified: 78.0 % (Opus 4.7) vs 75.0 % – computer GUI operations.
MCP‑Atlas: leading by +9.2 points – agent tool calling.
GPQA Diamond: 94.2 % – graduate‑level scientific QA.
Vision Multimodal: 98.5 % – visual image understanding.
BrowseComp: 79.3 % (Opus 4.7) vs 89.3 % (GPT‑5.4) – web search is the only lagging area.
Conclusion: Programming ability is clearly ahead, while web search remains a weakness.
Notable improvements
The model introduces several new features:
xhigh effort level: a fourth tier (low / medium / high / xhigh) that triggers more self‑verification and multi‑step reasoning. It is suited for large‑scale code refactoring (≈100 k lines), complex algorithm design, and cross‑module system architecture analysis. Activate by sending extra_body={"reasoning_effort": "xhigh"} in the API request.
Vision capability: benchmark score of 98.5 % with higher‑resolution image input, improving recognition of charts, flow diagrams, and UI screenshots.
Context window: a 1 M‑token window at the standard price, roughly equivalent to the full text of "War and Peace" (≈580 k words), a 100 k‑line code file, or an entire medium‑sized project's Git history.
Pricing
Opus 4.7 keeps the same pricing as Opus 4.6:
Input: $5 per million tokens.
Output: $25 per million tokens.
Performance concerns
Stella Laurenzo, an AMD senior director and heavy Claude Code user, audited 6,852 conversation logs, 17,871 reasoning blocks, and over 230 k tool calls. Her systematic re‑testing on BridgeBench (a cross‑architecture benchmark) showed:
Accuracy: 83.3 % → 68.3 % (‑15 percentage points).
Ranking: 2nd → 10th.
The drop means that, on average, more than 1.5 bugs appear for every 10 complex coding tasks.
“Intelligence reduction” three‑fold
Hidden reasoning: Since February, the UI no longer shows step‑by‑step reasoning, only the final answer.
Cache TTL cut: Context cache lifetime reduced from 1 hour to 5 minutes, causing token consumption to rise sharply when users pause.
Pricing model shift: Subscription changed from a fixed $200/month to a $20/month base with pay‑as‑you‑go overage, potentially raising heavy‑user spend to $400‑$600/month.
Why the changes?
Anthropic cites exploding inference costs as the primary driver—high‑frequency users (programmers, data scientists) can consume thousands of tokens per session. Competitive pressure from OpenAI’s Codex ($100/month tier) also influences the new strategy.
Selection advice
Complex code refactoring / agent development: Use Claude Opus 4.7 with xhigh mode.
Simple daily chat / light assistance: Choose Claude Sonnet 4.6 for better cost‑performance.
Strong web‑search requirement: Opt for GPT‑5.4, which leads BrowseComp by 10 points.
Price‑sensitive heavy users: Wait for Opus 4.7 price reduction or switch to another provider.
Final thoughts
From a technical standpoint, Claude Opus 4.7 is a milestone—its programming prowess, multimodal vision, and 1 M‑token context are impressive. However, the disclosed “intelligence reduction” measures cast a shadow over its commercial credibility, and the market will ultimately decide its fate.
Sources: Anthropic official blog, BridgeBench benchmark report, Quantum Bit, tenfy’s blog, APIYI technical review.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Lao Guo's Learning Space
AI learning, discussion, and hands‑on practice with self‑reflection
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
