How 16 Claude Agents Burned $140K to Build a C Compiler in Opus 4.6

Anthropic’s midnight release of Claude Opus 4.6 showcased a $140,000 “stress test” where 16 Claude agents collaboratively wrote a Linux‑compatible C compiler, achieving a 100‑k‑line Rust codebase, while the model also added deep Excel/PPT integration and lifted finance benchmark scores by up to 23 percentage points.

AI Insight Log
AI Insight Log
AI Insight Log
How 16 Claude Agents Burned $140K to Build a C Compiler in Opus 4.6

$20 K "Violent" Stress Test

Researcher Nicholas Carlini hired a team of 16 Claude agents to write a C compiler from scratch, targeting the ability to compile the Linux 6.9 kernel, QEMU, FFmpeg, and to run the classic game Doom.

API‑call cost: $20,000 (≈ ¥144,000).

Workload: 2,000 Claude‑Code sessions.

Output: a 100,000‑line Rust project.

Result: the generated compiler successfully built Linux 6.9, QEMU, FFmpeg and passed the Doom execution test.

Agentic Workflow

Parallel collaboration: 16 Claude instances operated concurrently on the same Git repository.

Resource locking: File‑lock mechanisms prevented edit conflicts when an agent modified a file.

Self‑correction: Upon encountering bugs, agents inspected logs, wrote test cases, and patched the code similarly to human engineers.

The compiled code runs, but its performance is slower than an unoptimized GCC build and the Rust code quality does not yet match that of expert Rust developers.

Financial‑Domain "Dimensionality Reduction"

Opus 4.6 adds deep integration with Excel and PowerPoint, enabling complex multi‑sheet analysis, pivot‑table editing, chart modification, finance‑grade formatting, and template‑driven slide generation.

Cowork mode: Granting Claude access to a desktop folder turns the model into an "invisible colleague" that can read, edit, and create files, run multiple analysis tasks in parallel, and draft or revise documents while searching for information.

Benchmark Gains

Internal "Real‑World Finance" evaluation (≈ 50 investment‑bank and private‑equity cases) shows a 23‑point improvement over Sonnet 4.5.

Third‑party Vals AI TaxEval test: 76.0 % score.

Agentic tool‑use accuracy on the Telecom dataset: 99.3 %.

GPQA Diamond (graduate‑level reasoning) score: 91.3 % (vs. 87.0 % for Opus 4.5 and 91.9 % for Gemini 3 Pro).

Long‑context support: up to 1 million tokens.

SWE‑bench Verified (agentic coding) score: 80.8 % (down 0.1 % from Opus 4.5’s 80.9 %), indicating a shift toward planning and multi‑step task stability.

The $20 K C‑compiler prototype remains experimental, demonstrating that a budget‑driven AI team can independently complete a clean‑room implementation of a highly complex software project.

Claude Opus 4.6 stress test
Claude Opus 4.6 stress test
Financial integration screenshot
Financial integration screenshot
Opus 4.6 benchmark comparison
Opus 4.6 benchmark comparison

Code example

Hello World
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI code generationlarge language modelFinancial AIAgentic workflowClaude Opus
AI Insight Log
Written by

AI Insight Log

Focused on sharing: AI programming | Agents | Tools

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.