Artificial Intelligence 7 min read

How 16 Claude Agents Burned $140K to Build a C Compiler in Opus 4.6

Anthropic’s midnight release of Claude Opus 4.6 showcased a $140,000 “stress test” where 16 Claude agents collaboratively wrote a Linux‑compatible C compiler, achieving a 100‑k‑line Rust codebase, while the model also added deep Excel/PPT integration and lifted finance benchmark scores by up to 23 percentage points.

AI Insight Log

Feb 5, 2026

How 16 Claude Agents Burned $140K to Build a C Compiler in Opus 4.6

$20 K "Violent" Stress Test

Researcher Nicholas Carlini hired a team of 16 Claude agents to write a C compiler from scratch, targeting the ability to compile the Linux 6.9 kernel, QEMU, FFmpeg, and to run the classic game Doom.

API‑call cost: $20,000 (≈ ¥144,000).

Workload: 2,000 Claude‑Code sessions.

Output: a 100,000‑line Rust project.

Result: the generated compiler successfully built Linux 6.9, QEMU, FFmpeg and passed the Doom execution test.

Agentic Workflow

Parallel collaboration: 16 Claude instances operated concurrently on the same Git repository.

Resource locking: File‑lock mechanisms prevented edit conflicts when an agent modified a file.

Self‑correction: Upon encountering bugs, agents inspected logs, wrote test cases, and patched the code similarly to human engineers.

The compiled code runs, but its performance is slower than an unoptimized GCC build and the Rust code quality does not yet match that of expert Rust developers.

Financial‑Domain "Dimensionality Reduction"

Opus 4.6 adds deep integration with Excel and PowerPoint, enabling complex multi‑sheet analysis, pivot‑table editing, chart modification, finance‑grade formatting, and template‑driven slide generation.

Cowork mode: Granting Claude access to a desktop folder turns the model into an "invisible colleague" that can read, edit, and create files, run multiple analysis tasks in parallel, and draft or revise documents while searching for information.

Benchmark Gains

Internal "Real‑World Finance" evaluation (≈ 50 investment‑bank and private‑equity cases) shows a 23‑point improvement over Sonnet 4.5.

Third‑party Vals AI TaxEval test: 76.0 % score.

Agentic tool‑use accuracy on the Telecom dataset: 99.3 %.

GPQA Diamond (graduate‑level reasoning) score: 91.3 % (vs. 87.0 % for Opus 4.5 and 91.9 % for Gemini 3 Pro).

Long‑context support: up to 1 million tokens.

SWE‑bench Verified (agentic coding) score: 80.8 % (down 0.1 % from Opus 4.5’s 80.9 %), indicating a shift toward planning and multi‑step task stability.

The $20 K C‑compiler prototype remains experimental, demonstrating that a budget‑driven AI team can independently complete a clean‑room implementation of a highly complex software project.

Code example

Hello World

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI code generation large language model Financial AI Agentic workflow Claude Opus

Written by

AI Insight Log

Focused on sharing: AI programming | Agents | Tools

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

$20 K "Violent" Stress Test

Agentic Workflow

Financial‑Domain "Dimensionality Reduction"

Benchmark Gains

Code example

AI Insight Log

How this landed with the community

Was this worth your time?

0 Comments

$20 K "Violent" Stress Test