Claude Opus 4.7 Arrives with a Massive Leap in Programming Power

Claude Opus 4.7 dramatically outperforms Opus 4.6 and rivals GPT‑5.4 and Gemini 3.1 Pro across benchmarks, boosts programming task success by up to 13%, triples bug‑fixing on SWE‑bench, raises visual resolution three‑fold, adds a finer‑grained xhigh effort level, tightens security controls, and keeps pricing unchanged.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Claude Opus 4.7 Arrives with a Massive Leap in Programming Power

Benchmark Overview

Anthropic’s cross‑domain benchmark chart shows Opus 4.7 surpassing Opus 4.6 on most tasks and edging out GPT‑5.4 and Gemini 3.1 Pro.

Claude Opus 4.7 cross-domain benchmark comparison
Claude Opus 4.7 cross-domain benchmark comparison

What Makes Opus 4.7 Stronger Than 4.6?

Anthropic states that Opus 4.7 delivers a “significant improvement” in advanced software engineering, especially on the hardest tasks.

Key data points:

Cursor : 13% higher task‑completion rate on a 93‑task programming benchmark, solving four tasks that both Opus 4.6 and Sonnet 4.6 could not.

Rakuten (SWE‑bench) : Real‑world bug‑fixing capability is three times that of Opus 4.6.

XBOW (self‑penetration testing) : Visual accuracy jumps from 54.5% to 98.5%.

Notion : Tool‑calling accuracy and planning improve by over 10%; it is the first model to pass implicit‑need tests.

Visual Capability: Resolution Triples

Opus 4.7 can now accept images with a longest side of 2,576 pixels (≈3.75 MP), more than three times the previous limit.

This enables computer‑use agents to read dense screenshots without missing small text, extract data from complex charts with far higher precision, and handle pixel‑level tasks in scientific or legal documents.

Solve Intelligence (life‑science patent workflow) confirmed the jump in understanding from chemical structures to intricate technical drawings.

Note: higher‑resolution images consume more tokens; down‑sampling is advisable when ultra‑high precision is unnecessary.

Instruction Following: Much More Literal

Opus 4.7 shows a large gain in instruction compliance, but Anthropic warns that prompts tuned for older models may produce unexpected results because the new model follows instructions “to the letter.” Users should review and tighten prompts before upgrading.

New Feature: xhigh Effort Level

Opus 4.7 introduces an extra‑high (xhigh) effort tier between the existing high and max levels, giving finer control over the trade‑off between reasoning quality and response latency.

Claude Code now defaults to xhigh. For programming and agent scenarios, Anthropic recommends starting with high or xhigh.

The following chart compares token usage and task scores across effort levels:

Different effort levels token usage vs task score comparison
Different effort levels token usage vs task score comparison

Security: Cautious First Step

Anthropic’s Project Glasswing tackles AI security risks and opportunities. Opus 4.7 is the first model deployed under Glasswing, deliberately limited in security capability compared with Claude Mythos Preview.

It includes automatic detection and blocking of high‑risk security requests.

Legitimate security researchers and penetration‑testing engineers can join a whitelist via the Cyber Verification Program.

Safety Evaluation

Claude Opus 4.7 behavior audit score
Claude Opus 4.7 behavior audit score

In alignment safety, Opus 4.7 is comparable to Opus 4.6, with low rates of deception, flattery, and misuse. Some dimensions (honesty, resistance to prompt‑injection) improve, while a few (over‑detailed harmful‑substance replies) regress slightly. Overall: “generally aligned and trustworthy, but not yet ideal.” Mythos Preview remains Anthropic’s best‑aligned model.

Pricing & Availability

Pricing stays the same as Opus 4.6:

Input: $5 / million tokens

Output: $25 / million tokens

Available on the full Claude product line, Claude API ( claude‑opus‑4‑7), Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Foundry.

Additional New Features

/ultrareview command (Claude Code) : One‑click deep code review with up to three free trials for Pro and Max users.

Task Budgets (public beta, API) : Gives developers a mechanism to guide Claude’s token allocation over long tasks, preventing front‑heavy or back‑heavy spending.

Auto Mode expansion : Max users can now enable Auto Mode, allowing Claude to autonomously request permissions during long tasks, reducing interruptions.

Upgrade Considerations

When moving from Opus 4.6 to 4.7 in production, watch for two changes:

New tokenizer increases token count by 1.0–1.35× for the same input, depending on content type.

Higher effort levels generate more tokens, especially in multi‑turn agent conversations.

Anthropic provides a migration guide; test on real traffic before full rollout.

Conclusion

Core takeaways for Opus 4.7 are stronger programming, clearer vision, more precise instruction following, and tighter security. Recommended actions:

Claude Code users: start using the default xhigh level and try /ultrareview.

API developers: revisit prompts, monitor token usage, and read the migration guide.

Security professionals: apply for the Cyber Verification Program if you have legitimate needs.

The author’s favorite aspect is the model’s ability to act as a “better colleague” that can challenge you and help you make superior decisions, rather than merely echoing your thoughts.

programmingsecuritybenchmarkAI modelClaudeeffort levelOpus 4.7visual resolution
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.