How 16 Claude Agents Burned $140K to Build a C Compiler in Opus 4.6
Anthropic’s midnight release of Claude Opus 4.6 showcased a $140,000 “stress test” where 16 Claude agents collaboratively wrote a Linux‑compatible C compiler, achieving a 100‑k‑line Rust codebase, while the model also added deep Excel/PPT integration and lifted finance benchmark scores by up to 23 percentage points.
$20 K "Violent" Stress Test
Researcher Nicholas Carlini hired a team of 16 Claude agents to write a C compiler from scratch, targeting the ability to compile the Linux 6.9 kernel, QEMU, FFmpeg, and to run the classic game Doom.
API‑call cost: $20,000 (≈ ¥144,000).
Workload: 2,000 Claude‑Code sessions.
Output: a 100,000‑line Rust project.
Result: the generated compiler successfully built Linux 6.9, QEMU, FFmpeg and passed the Doom execution test.
Agentic Workflow
Parallel collaboration: 16 Claude instances operated concurrently on the same Git repository.
Resource locking: File‑lock mechanisms prevented edit conflicts when an agent modified a file.
Self‑correction: Upon encountering bugs, agents inspected logs, wrote test cases, and patched the code similarly to human engineers.
The compiled code runs, but its performance is slower than an unoptimized GCC build and the Rust code quality does not yet match that of expert Rust developers.
Financial‑Domain "Dimensionality Reduction"
Opus 4.6 adds deep integration with Excel and PowerPoint, enabling complex multi‑sheet analysis, pivot‑table editing, chart modification, finance‑grade formatting, and template‑driven slide generation.
Cowork mode: Granting Claude access to a desktop folder turns the model into an "invisible colleague" that can read, edit, and create files, run multiple analysis tasks in parallel, and draft or revise documents while searching for information.
Benchmark Gains
Internal "Real‑World Finance" evaluation (≈ 50 investment‑bank and private‑equity cases) shows a 23‑point improvement over Sonnet 4.5.
Third‑party Vals AI TaxEval test: 76.0 % score.
Agentic tool‑use accuracy on the Telecom dataset: 99.3 %.
GPQA Diamond (graduate‑level reasoning) score: 91.3 % (vs. 87.0 % for Opus 4.5 and 91.9 % for Gemini 3 Pro).
Long‑context support: up to 1 million tokens.
SWE‑bench Verified (agentic coding) score: 80.8 % (down 0.1 % from Opus 4.5’s 80.9 %), indicating a shift toward planning and multi‑step task stability.
The $20 K C‑compiler prototype remains experimental, demonstrating that a budget‑driven AI team can independently complete a clean‑room implementation of a highly complex software project.
Code example
Hello WorldSigned-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
