Why 80% of Anthropic’s Code Is Merged by Claude and How “Close the Loop” Redefines Agent Testing

Anthropic reports that Claude now merges over 80% of its internal code, with failure rates cut by three‑fold, and outlines how planning, error‑recovery, and long‑context abilities enable a “Close the Loop” approach that developers must adopt to build future‑ready AI agents.

Machine Heart
Machine Heart
Machine Heart
Why 80% of Anthropic’s Code Is Merged by Claude and How “Close the Loop” Redefines Agent Testing

Claude’s Growing Role in Anthropic’s Engineering

Anthropic’s product manager Theo Chu highlighted that more developers now experience tangible efficiency gains from Claude, with some claiming a ten‑fold boost. Internally, Claude merges over 80% of the code , indicating a shift from a question‑answer tool to an autonomous agent that can validate, correct, and iterate on its outputs.

Model Failure Rate Decline

Using the SWE‑bench Verified benchmark—a suite of GitHub issues requiring code understanding, modification, and test‑based verification—Theo showed that a year‑old Sonnet 3.7 scored around 60% , while the newer Opus 4.8 reached 88% . This translates to a three‑fold reduction in failure occurrences, demonstrating that the model’s improvement is driven by a rapid drop in error rate rather than merely solving more problems.

He warned that evaluating today’s models with last‑year tasks will underestimate their true capabilities, as some tests are approaching saturation.

Three Core Advances

Plan before acting : In a reconstruction task for the Claude.ai website, older models jumped straight into code generation, producing incomplete, non‑functional results. Opus 4.8 first reasoned about specifications, caught errors during planning, and produced concise, correct implementations, illustrating the benefit of allowing the model to think first.

Error recovery and self‑correction : Earlier agents suffered from “doom looping,” repeatedly retrying the same flawed approach after feedback. Newer models read feedback, understand failure reasons, and explore alternative paths, enabling genuine error‑recovery capabilities essential for long‑running agents.

Extended context handling : Previous models lost track of long‑running tasks, but the latest models maintain coherence over up to a million tokens, allowing developers to feed entire codebases or product requirements instead of isolated snippets.

Implications for Developers

Theo advises developers to redesign the model’s environment so it receives actionable feedback, to shrink “scaffolding” (excessive prompt engineering and patchwork around older models), and to adopt forward‑looking evaluation suites that test tasks the model cannot yet solve.

He also recommends exposing agents to front‑end interaction capabilities—such as clicking UI elements and validating page states—so they can execute the loop execute → verify → correct → re‑execute autonomously.

Future‑Ready Product Strategies

Developers should dynamically refresh their evals, aim for ambitious, unmet tasks, and reduce reliance on legacy prompt constraints. By granting models appropriate autonomy and integrating adaptive thinking mechanisms, products can harness the full potential of Claude’s evolving intelligence.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI agentsClaudeAnthropicSWE-benchClose the LoopEval designModel planning
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.