Claude Opus 4.8 Hits Two 0% Honesty Scores in Just 41 Days

Anthropic released Claude Opus 4.8 only 41 days after Opus 4.7, delivering unprecedented 0 % lie‑rate and 0 % lazy‑answer rate, improving code‑defect silence by four‑fold, boosting SWE‑bench Pro to 69.2 % and GDPval‑AA to 1890 Elo, while adding Dynamic Workflows, Effort Control, a richer Messages API and a fast‑mode that runs 2.5× faster for a third of the cost.

ZhiKe AI
ZhiKe AI
ZhiKe AI
Claude Opus 4.8 Hits Two 0% Honesty Scores in Just 41 Days

Rapid Release with a New Focus

On May 28, 2026 Anthropic announced Claude Opus 4.8, arriving only 41 days after Opus 4.7. Rather than modest score gains or latency improvements, the upgrade targets a dimension that users have long felt but that vendors have rarely addressed: model honesty.

Zero‑Percent Honesty Metrics

Internal evaluations show two historic firsts for Opus 4.8. The lie‑rate —the probability the model fabricates answers when the input data is flawed—drops from 40 % in Opus 4.5 and 25 % in Opus 4.7 to 0 % . The lazy‑answer rate —the chance the model gives a wrong answer instead of investigating a problem—also falls to 0 % . In code‑quality tests, the likelihood that Opus 4.8 silently produces defective code is reduced by roughly four‑fold compared with Opus 4.7 (Anthropic System Card).

Real‑World Example

A developer using Claude Code with Opus 4.8 attempted a large code‑migration. When the server rejected the submission because a teammate had pushed an urgent fix, the developer replied “just force‑overwrite.” Claude refused, explaining that overwriting would discard the teammate’s fix, and instead merged the teammate’s changes before retrying. This illustrates the model’s shift from “clever” to “reliable.”

Benchmark Performance

Despite the honesty focus, Opus 4.8 maintains strong capability scores. On the high‑difficulty SWE‑bench Pro programming test it rises from 64.3 % to 69.2 %, about ten points higher than GPT‑5.5. On the GDPval‑AA knowledge‑work metric it reaches 1890 Elo, 121 points above GPT‑5.5, corresponding to an estimated 67 % win‑rate. (*Note: OSWorld’s Opus 4.7 score has been corrected to 82.3 %; the data are self‑reported by Anthropic.)

*Data source: Anthropic Claude Opus 4.8 System Card.

Early‑Tester Feedback

Cursor Staff Engineer Tom Pritchard observed that Opus 4.8 “asks the right questions, spots its own errors, challenges unreasonable suggestions, and builds confidence before making big changes.” Devin’s CEO Scott Wu added that the model’s tool‑calling is “clean and consistent enough for fully unattended engineering workloads,” noting fixes to verbose comments and tool‑call issues present in Opus 4.7.

Shift in Competitive Landscape

For the past two years AI competition has centered on larger parameters, higher benchmark scores, and longer context windows—answers to “can the AI do it?” Opus 4.8 pivots to “can you trust what the AI does?” By lowering deception and lazy‑answer rates, Anthropic moves the race toward production‑grade reliability.

New Product Features

Dynamic Workflows (research preview) lets Claude Code plan a complex task, launch hundreds of parallel sub‑agents, and automatically verify results before reporting back. The intended scenario is massive code‑base migration, where the entire workflow—from launch to merge—is handled autonomously.

Effort Control adds a “thinking depth” selector in claude.ai and Cowork, letting users choose how much reasoning effort the model spends on each response.

Messages API Enhancements permit inserting system entries inside the messages array, enabling dynamic instruction updates mid‑task without breaking the prompt cache. This supports runtime adjustments of permissions, token budgets, and context for agent‑based applications.

Pricing and Speed

Anthropic adopts a “more usage, same price” strategy. Standard mode pricing remains unchanged from Opus 4.7: $5 per M input tokens and $25 per M output tokens. Fast Mode offers 2.5× the throughput of Standard mode while charging only one‑third of the previous Fast Mode price, effectively making previously expensive real‑time scenarios affordable.

Standard mode : $5 / M input, $25 / M output, speed baseline.

Fast mode : $10 / M input, $50 / M output, 2.5× speed, 1/3 the prior Fast Mode cost.

What Developers Can Do Now

Try Opus 4.8 Fast Mode for latency‑sensitive interactions (2.5× speed, 1/3 price).

Experiment with Effort Control to compare outputs at different reasoning depths.

Apply for the Dynamic Workflows preview if you are on an Enterprise, Team, or Max plan.

Watch for the upcoming Claude Mythos Preview, which Anthropic plans to roll out to all customers within weeks.

In just 41 days, Opus 4.8 rewrote AI honesty history with two 0 % scores, signaling that the next AI arms race will be judged not only by raw power but by trustworthiness.

Honesty metrics comparison
Honesty metrics comparison
Benchmark results
Benchmark results
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

pricingSWE-benchAI honestyClaude Opus 4.8Dynamic WorkflowsEffort ControlGDPval-AA
ZhiKe AI
Written by

ZhiKe AI

We dissect AI-era technologies, tools, and trends with a hardcore perspective. Focused on large models, agents, MCP, function calling, and hands‑on AI development. No fluff, no hype—only actionable insights, source code, and practical ideas. Get a daily dose of intelligence to simplify tech and make efficiency tangible.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.