Do Large‑Model Code Generators Really Excel? ARC‑AGI‑2/3 Reveals the Harsh Truth

While recent model releases boast near‑perfect scores on benchmarks like MMLU and HumanEval, the ARC‑AGI‑2 and ARC‑AGI‑3 leaderboards expose a stark gap between headline numbers and genuine programming intelligence, highlighting cost, fluid reasoning, and real‑world applicability.

Ops Development & AI Practice
Ops Development & AI Practice
Ops Development & AI Practice
Do Large‑Model Code Generators Really Excel? ARC‑AGI‑2/3 Reveals the Harsh Truth

Benchmark hype vs. real‑world performance

Model release notes in 2026 repeatedly highlight near‑ceiling MMLU scores, record‑breaking SWE‑bench results, and saturated HumanEval numbers, leading to claims that AI code generation has caught up with or surpassed humans. When the same models are applied to proprietary codebases, obscure private protocols, or legacy business logic, they still fall into traps, fabricate APIs, and miss edge cases.

Why conventional benchmarks fall short

Most public leaderboards evaluate "seen" problems—tasks that resemble material present in the training data. Real development work is highly personalized, non‑standard, and filled with abstract reasoning problems that never appeared in the models' pre‑training corpora.

ARC‑AGI: measuring fluid intelligence

Proposed by François Chollet in 2019, ARC‑AGI shifts evaluation from knowledge recall to fluid intelligence: the ability to abstract from a handful of input‑output examples and generalize to unseen situations. Each ARC item presents a colored‑grid puzzle with a few example pairs; the underlying rule (e.g., recolor neighbors, mirror across an axis) appears only once in the entire dataset, forcing on‑the‑spot abstraction.

ARC‑AGI‑2: tougher puzzles, higher stakes

Released in 2025, ARC‑AGI‑2 adds more devious traps and penalizes "symbolic shortcuts" that rely on brute‑force search. The leaderboard (as of April 2026) shows:

GPT‑5.5 (xHigh) – ~85% (≈ $10 per task)

GPT‑5.4 Pro (xHigh) – ~84%

Gemini 3.1 Pro (Preview) – ~77%

Claude 4.7 (Max) – ~73%

Claude Opus 4.6 (Medium) – ~66%

Claude Sonnet 4.6 (High) – ~60%

GPT‑5.4 (Medium) – ~54%

Grok 4 (Refine) – ~28%

o3 (High) – ~5%

GPT‑4.5 – ~1%

Human adults average around 60% on the same set. The chart also plots cost per task, revealing that the top‑scoring models spend close to $10 per question, whereas cheaper models plateau near 30%.

This suggests that current high scores stem more from massive compute and multi‑step prompting than from a genuine leap in intelligence.

ARC‑AGI‑3: interactive, zero‑shot reasoning

ARC‑AGI‑3 transforms static puzzles into a minimalist video‑game‑like environment where the agent must explore, learn on‑the‑fly, remember observations, and set its own sub‑goals. It evaluates four capabilities:

On‑the‑fly Learning

Exploration

Memory

Goal Acquisition

Leaderboard results are dramatically lower:

Anthropic Opus 4.6 (Max) – ~0.6%

Gemini 3.1 Pro (Preview) – ~0.45%

GPT‑5.4 (High) – ~0.25%

Human players achieve near‑100% completion, indicating a gap of roughly 0.6% vs. 100% between the strongest models and humans for continuous, interactive tasks.

Implications for developers

Most enterprise coding tasks resemble ARC‑style challenges: deciphering undocumented internal protocols from logs, refactoring legacy systems with counter‑intuitive interfaces, and extracting rules for novel product requirements. These problems cannot be solved by copying Stack Overflow answers; they require true abstract reasoning.

Tracking a model’s progress on ARC‑AGI‑2 and ARC‑AGI‑3—and the associated cost per task—offers a more reliable indicator of whether it can genuinely assist with complex, real‑world code than marginal gains on HumanEval or similar benchmarks.

Leaderboard access and usage guidance

ARC Prize official leaderboard: https://arcprize.org/leaderboard

The page provides three tabs:

ARC‑AGI‑1: first generation, near saturation; useful for historical comparison.

ARC‑AGI‑2: current primary battlefield; recommended for focused monitoring.

ARC‑AGI‑3: future challenge (3‑5 years); watch for models that move from ~1% toward 10%.

Two practical observations from the leaderboard:

Score alone is insufficient; consider cost per task. Achieving 30% accuracy for $1 may be more valuable than 60% for $10 in production.

Follow trend lines rather than static rankings. Weekly updates reveal which models are advancing and which are stagnant.

Conclusion

When the AI community debates how many more points a model can gain, ARC‑AGI reminds us that real intelligence is measured by the flash of insight when confronting an unfamiliar world, not by stacking benchmark scores.

Code example

来看 arcprize.org/leaderboard 今天的数据(截至 2026 年 4 月):
GPT-5.5 (xHigh):约 85%,排名第一
GPT-5.4 Pro (xHigh):约 84%
Gemini 3.1 Pro (Preview):约 77%
Claude 4.7 (Max):约 73%
Claude Opus 4.6 (120K, Medium):约 66%
Claude Sonnet 4.6 (High):约 60%
GPT-5.4 (Medium):约 54%
code generationlarge language modelsbenchmarkAI evaluationARC-AGIfluid intelligence
Ops Development & AI Practice
Written by

Ops Development & AI Practice

DevSecOps engineer sharing experiences and insights on AI, Web3, and Claude code development. Aims to help solve technical challenges, improve development efficiency, and grow through community interaction. Feel free to comment and discuss.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.