What Happens When a Code Agent Faces 1,000+ Files? CoDA‑Bench Exposes the Real Bottleneck

CoDA‑Bench, a new benchmark from RUC, places code agents in a sandbox containing over a thousand heterogeneous data files and requires them to locate the correct dataset, write analysis code, and produce answers, revealing that current agents achieve only about 61 % accuracy overall and struggle mainly with data discovery rather than code generation.

Machine Heart
Machine Heart
Machine Heart
What Happens When a Code Agent Faces 1,000+ Files? CoDA‑Bench Exposes the Real Bottleneck

RUC’s research team introduced CoDA‑Bench , a benchmark that puts a Code Agent into a Linux sandbox with more than 1,000 data files. The agent receives only a natural‑language task description and no information about file names, paths, or schemas, forcing it to discover the relevant data before writing analysis code.

The benchmark evaluates two capabilities:

Data Intelligence : the ability to discover, understand, and select the correct data source in a complex environment.

Code Intelligence : the ability to write correct analysis code based on the discovered data and obtain the right result.

To construct realistic data environments, the team analyzed co‑occurrence relationships among datasets in Kaggle notebooks. Datasets that frequently appear together in the same notebook were grouped into semantic “communities.” Each task’s distractor files are drawn from the same community, making them topically and structurally similar to the target data, which prevents agents from relying on simple keyword matching.

Task generation proceeds in reverse: the researchers extract reproducible analysis results (statistics, rankings, ratios, aggregations) from real Kaggle notebooks, treat these as solution anchors , and then formulate natural‑language questions that lead to those results. This yields tasks that are verifiable, sourced from real analysis workflows, and iteratively refined to avoid obvious cues.

Evaluation covered several state‑of‑the‑art Code Agents and frameworks, including Claude Code, Codex CLI, OpenHands, and Mini‑SWE‑Agent. On the full CoDA‑Bench suite, the highest execution accuracy achieved was 61.1 % . On the harder CoDA‑HARD subset, the best accuracy dropped to 49.6 % .

To isolate the impact of data discovery, an oracle experiment was performed. In the oracle setting the correct data path is provided, so only the code‑generation stage is tested. The gap between ordinary and oracle settings was large: for example, Claude Code + Sonnet‑4.6 rose from 45.4 % to 73.1 % on CoDA‑HARD, and OpenHands + GPT‑5.5 rose from 44.5 % to 68.9 %.

These results demonstrate that the primary bottleneck for current Code Agents is **data discovery**, not code synthesis. Benchmarks that hand over the correct files to agents may substantially overestimate their real‑world capability.

CoDA‑Bench therefore fills an important evaluation gap by requiring agents to (1) decide which data to use, (2) locate it in a noisy, realistic file system, (3) verify its relevance, and only then (4) write and execute analysis code.

All papers, code, and datasets are openly released for the community to experiment with and extend.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

artificial-intelligencebenchmarkcode-intelligencedata-discoverycode-agent
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.