13 min read

Claude 4 and Claude Code Released – Anthropic API Adds Four Powerful New Features

Anthropic unveiled Claude Opus 4 and Claude Sonnet 4, the strongest coding models to date, detailed benchmark results, new memory and tool‑use capabilities, the Claude Code IDE extensions, and four fresh API functions that together expand AI agent development.

ShiZhen AI

May 23, 2025

Claude 4 and Claude Code Released – Anthropic API Adds Four Powerful New Features

Claude 4 Models

Claude Opus 4 leads SWE‑bench (72.5% accuracy) and Terminal‑bench (43.2%). It sustains performance for multi‑hour, multi‑step tasks, outperforming Sonnet variants.

External evaluations: Cursor reports improved code‑base understanding; Replit reports higher multi‑file change accuracy; Block’s “goose” agent notes enhanced debugging reliability; Rakuten validated a 7‑hour open‑source refactoring workload; Cognition highlighted handling of operations missed by prior models.

Claude Sonnet 4 builds on Sonnet 3.7, achieves 72.7% on SWE‑bench, balances performance and efficiency, and is available to free‑tier users.

Both models are accessible via Anthropic API, Amazon Bedrock, Google Cloud Vertex AI. Pricing unchanged: Opus 4 $15 / M input, $75 / M output; Sonnet 4 $3 / M input, $15 / M output.

Model Improvements

Extended reasoning and tool use (beta) : Models can alternate between reasoning and tool execution (e.g., web search) during extended‑thinking runs.

Parallel tool usage and enhanced memory : When granted local‑file access, models can run tools in parallel, follow instructions more precisely, and persist key facts in a “memory file” for long‑term task awareness (e.g., Opus 4 created a navigation guide while playing Pokémon).

Thought summarization : A smaller model compresses lengthy reasoning; invoked in ~5 % of cases.

Reduced shortcut behavior : Probability of “shortcut” or “gaming” behavior on agent tasks is 65 % lower than Sonnet 3.7.

Claude Code

Claude Code integrates Claude’s core intelligence into terminal, IDE, and background SDK workflows.

Beta extensions for VS Code and JetBrains embed edit suggestions directly in files. The Claude Code SDK enables custom agents; the “Claude Code on GitHub” demo shows PR‑based interactions where tagging the app triggers code‑review feedback, CI error fixes, or code modifications via the /install‑github‑app command.

Anthropic API New Features

Code execution tool : Executes Python code, generates visualizations, and analyzes data within API calls.

MCP connector : Connects Claude to any remote MCP server without custom client code; the API request includes only the server URL.

File API : Supports batch document upload and repeated referencing across conversations.

Extended prompt cache : TTL increased from 5 minutes to 1 hour, reducing cost up to 90 % and latency up to 85 % for long prompts.

Benchmark Methodology

Opus 4 and Sonnet 4 are hybrid reasoning models evaluated with and without extended reasoning (up to 64 K tokens). Benchmarks include TAU‑bench, SWE‑bench Verified, Terminal‑bench, GPQA Diamond, MMMLU, MMMU, and AIME.

For high‑compute benchmarks, a scaffold with two tools (Bash and a string‑replacement file editor) was used; the third “planning tool” from Sonnet 3.7 was omitted. All 500 SWE‑bench questions were scored; OpenAI scores are based on a 477‑question subset.

Parallel attempts were sampled; patches causing regressions were discarded following the reject‑sampling approach of Agentless (Xia et al. 2024). An internal scoring model selected the best candidates. Resulting scores: Opus 4 79.4 % and Sonnet 4 80.2 % on the high‑compute suite.

Without extended reasoning, GPQA Diamond scores: Opus 4 74.9 %, Sonnet 4 70.0 %; MMMLU: Opus 4 87.4 %, Sonnet 4 85.4 %; MMMU: Opus 4 73.7 %, Sonnet 4 72.6 %; AIME: Opus 4 33.9 %, Sonnet 4 33.1 %.

TAU‑bench used additional prompts indicating extended reasoning and tool use; maximum steps increased from 30 to 100 to accommodate longer reasoning trajectories.

Performance Sources

OpenAI: o3 blog post, system card, GPT‑4.1 blog post, hosted evaluation.

Gemini: Gemini 2.5 Pro preview model card.

Claude: Claude 3.7 Sonnet blog post.

AI agents large language models API benchmarking Anthropic Claude 4

Written by

ShiZhen AI

Tech blogger with over 10 years of experience at leading tech firms, AI efficiency and delivery expert focusing on AI productivity. Covers tech gadgets, AI-driven efficiency, and leisure— AI leisure community. 🛰 szzdzhp001

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.