CATArena: A Competitive Benchmark That Turns Agent Scoring into Evolutionary Learning
CATArena introduces a tournament‑style evaluation framework where AI agents iteratively code, compete, and improve across classic board games, using three‑dimensional quantitative scores to measure strategy programming, global learning, and generalization, and reveals how different LLM‑based agents learn and adapt over multiple rounds.
PaperAgent introduces CATArena, a new benchmark from the AGI‑Eval community (Shanghai Jiao Tong University & Meituan) that shifts large‑model agent evaluation from static test‑question scoring to an iterative, tournament‑based learning environment.
Framework Overview
CATArena runs agents in four classic games (Texas Hold'em, Bridge, Chess, Gomoku). Each agent writes code for the game, then engages in multi‑round head‑to‑head matches, observing opponents, replaying logs, and updating its own strategy, thereby achieving self‑evolution and peer learning.
Key resources: https://github.com/AGI-Eval-Official/CATArena and the paper https://arxiv.org/abs/2510.26852. An online replay platform is available at https://catarena.ai/replays.
Three‑Dimensional Scoring (Core Formulas)
Strategy Programming Ability : S_i = avg_{j\neq i}(W_{i,j}^1) – average win rate of agent i’s initial code against all opponents (its "combat power").
Global Learning : L_i = avg_{n\ge 2}(G_i^n - G_i^1) – improvement in win rate from round 1 to later rounds; positive values indicate "learning better".
Generalization Ability : U_i = B_i^{1var} - B_i^{1std} – difference between win rates under variant rules and standard rules; positive values show quick adaptation to new rules.
Experimental Design
Two participant categories were defined:
Minimal Agent : built on the ADK framework with six mainstream LLMs (DeepSeek‑3.1, Qwen3‑Coder‑480B, Doubao‑Seed‑1.6, GPT‑5, Claude‑4‑Sonnet, Gemini‑2.5‑pro).
Commercial Code Agent : includes Claude‑Code, CodeX, Gemini‑CLI, Qwen‑Coder, plus the best Minimal agents for a second round.
An additional baseline, LLM‑Player , lets the LLM generate moves directly without coding, to compare "coding" vs. "pure reasoning" abilities.
Each match is repeated four times and averaged; each tournament consists of N=4 iterative rounds.
Main Results: Who Is Strongest?
In the Minimal setting, Claude‑4‑Sonnet ranks first, while other models spread out. In the Commercial setting, performance gaps shrink, indicating that engineering the framework extracts more potential from each model.
Learning Ability
Claude‑4‑Sonnet (Minimal) shows a clear upward trajectory, demonstrating strong learning capability.
Most other agents exhibit unstable performance with no obvious trend.
Ablation: Agents vs. LLM‑Player
For highly strategic games (Chess, Gomoku), code‑based agents outperform LLM‑Players because explicit code can better exploit game rules.
For psychology/probability‑driven games (Texas Hold'em), LLM‑Players often win, as strategic nuance is harder to encode in code.
These findings confirm that Strategy Coding ≠ Reasoning ; CATArena fills a benchmark gap by measuring coding‑based strategic ability rather than pure inference.
Additional Tracks: ML & Multilingual
ML Track : Agents generate data, design code, and train models on GPUs. Most agents only implement basic models, leading to modest performance differences.
Multilingual Track : The same strategy is implemented in Python, JavaScript, and Go. Qwen3‑Coder shows the smallest variance and best cross‑language consistency; GPT‑5 and Doubao‑Seed display strong Python performance but drop sharply in JS/Go, highlighting challenges in abstract strategy transfer.
Future Directions
CATArena plans to incorporate more complex RTS, wargame, and economic simulations, and to introduce Human‑in‑the‑Loop loops so agents can learn from expert human players.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
