Claude Opus 4.7 Arrives with a Massive Leap in Programming Power
Claude Opus 4.7 dramatically outperforms Opus 4.6 and rivals GPT‑5.4 and Gemini 3.1 Pro across benchmarks, boosts programming task success by up to 13%, triples bug‑fixing on SWE‑bench, raises visual resolution three‑fold, adds a finer‑grained xhigh effort level, tightens security controls, and keeps pricing unchanged.
Benchmark Overview
Anthropic’s cross‑domain benchmark chart shows Opus 4.7 surpassing Opus 4.6 on most tasks and edging out GPT‑5.4 and Gemini 3.1 Pro.
What Makes Opus 4.7 Stronger Than 4.6?
Anthropic states that Opus 4.7 delivers a “significant improvement” in advanced software engineering, especially on the hardest tasks.
Key data points:
Cursor : 13% higher task‑completion rate on a 93‑task programming benchmark, solving four tasks that both Opus 4.6 and Sonnet 4.6 could not.
Rakuten (SWE‑bench) : Real‑world bug‑fixing capability is three times that of Opus 4.6.
XBOW (self‑penetration testing) : Visual accuracy jumps from 54.5% to 98.5%.
Notion : Tool‑calling accuracy and planning improve by over 10%; it is the first model to pass implicit‑need tests.
Visual Capability: Resolution Triples
Opus 4.7 can now accept images with a longest side of 2,576 pixels (≈3.75 MP), more than three times the previous limit.
This enables computer‑use agents to read dense screenshots without missing small text, extract data from complex charts with far higher precision, and handle pixel‑level tasks in scientific or legal documents.
Solve Intelligence (life‑science patent workflow) confirmed the jump in understanding from chemical structures to intricate technical drawings.
Note: higher‑resolution images consume more tokens; down‑sampling is advisable when ultra‑high precision is unnecessary.
Instruction Following: Much More Literal
Opus 4.7 shows a large gain in instruction compliance, but Anthropic warns that prompts tuned for older models may produce unexpected results because the new model follows instructions “to the letter.” Users should review and tighten prompts before upgrading.
New Feature: xhigh Effort Level
Opus 4.7 introduces an extra‑high (xhigh) effort tier between the existing high and max levels, giving finer control over the trade‑off between reasoning quality and response latency.
Claude Code now defaults to xhigh. For programming and agent scenarios, Anthropic recommends starting with high or xhigh.
The following chart compares token usage and task scores across effort levels:
Security: Cautious First Step
Anthropic’s Project Glasswing tackles AI security risks and opportunities. Opus 4.7 is the first model deployed under Glasswing, deliberately limited in security capability compared with Claude Mythos Preview.
It includes automatic detection and blocking of high‑risk security requests.
Legitimate security researchers and penetration‑testing engineers can join a whitelist via the Cyber Verification Program.
Safety Evaluation
In alignment safety, Opus 4.7 is comparable to Opus 4.6, with low rates of deception, flattery, and misuse. Some dimensions (honesty, resistance to prompt‑injection) improve, while a few (over‑detailed harmful‑substance replies) regress slightly. Overall: “generally aligned and trustworthy, but not yet ideal.” Mythos Preview remains Anthropic’s best‑aligned model.
Pricing & Availability
Pricing stays the same as Opus 4.6:
Input: $5 / million tokens
Output: $25 / million tokens
Available on the full Claude product line, Claude API ( claude‑opus‑4‑7), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.
Additional New Features
/ultrareview command (Claude Code) : One‑click deep code review with up to three free trials for Pro and Max users.
Task Budgets (public beta, API) : Gives developers a mechanism to guide Claude’s token allocation over long tasks, preventing front‑heavy or back‑heavy spending.
Auto Mode expansion : Max users can now enable Auto Mode, allowing Claude to autonomously request permissions during long tasks, reducing interruptions.
Upgrade Considerations
When moving from Opus 4.6 to 4.7 in production, watch for two changes:
New tokenizer increases token count by 1.0–1.35× for the same input, depending on content type.
Higher effort levels generate more tokens, especially in multi‑turn agent conversations.
Anthropic provides a migration guide; test on real traffic before full rollout.
Conclusion
Core takeaways for Opus 4.7 are stronger programming, clearer vision, more precise instruction following, and tighter security. Recommended actions:
Claude Code users: start using the default xhigh level and try /ultrareview.
API developers: revisit prompts, monitor token usage, and read the migration guide.
Security professionals: apply for the Cyber Verification Program if you have legitimate needs.
The author’s favorite aspect is the model’s ability to act as a “better colleague” that can challenge you and help you make superior decisions, rather than merely echoing your thoughts.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
