AutoResearch Advances: RUC & Microsoft Open‑Source Arbor Gives Agents Research Memory

Arbor, an open‑source autonomous research framework from RUC’s Gaoling AI Institute and Microsoft Research, structures the research loop with a growing hypothesis‑tree and insight back‑propagation, allowing agents to retain hypotheses, evidence, and failures, and achieves the best held‑out results on six real AO tasks, surpassing Codex and Claude Code.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
AutoResearch Advances: RUC & Microsoft Open‑Source Arbor Gives Agents Research Memory

Background

In the past few years, large‑language‑model (LLM) agents have progressed from simple chatbots to tool‑use agents, and now to systems such as Codex and Claude Code that can read project code, invoke external tools, modify files, and run experiments for extended periods. However, the ability to "execute a task" is still distinct from the ability to "conduct research".

Problem Statement

Real research requires iterative hypothesis generation, experiment design, failure analysis, and the accumulation of knowledge across many trials. Existing agents excel at isolated execution—changing code, logging results, and running evaluations—but they lack a persistent state that can store hypotheses, evidence, and insights, causing long‑running research to degrade into linear trial‑and‑error.

Arbor Design

Arbor introduces Autonomous Optimization (AO) , which formalizes research as a structured exploration of a Hypothesis‑Tree (HTR) . The system receives an initial artifact (e.g., model code, a harness, or a data pipeline), a research goal, and an executable evaluator. The agent iteratively refines the artifact while only observing the development set, aiming for genuine improvement on a held‑out test set.

The hypothesis‑tree grows by adding nodes that each contain four elements:

Hypothesis: a verifiable research claim (e.g., "changing this hyper‑parameter will improve accuracy").

Artifact version: the concrete code, configuration, or pipeline change associated with the hypothesis.

Experimental evidence: development‑set scores, logs, error messages, and any held‑out validation results.

Distilled insight: a reusable summary explaining why the experiment succeeded or failed, under what conditions it holds, and what should be kept, merged, or pruned.

After each experiment, Arbor back‑propagates the distilled insight up the tree, updating the global understanding of the research problem.

System Architecture

Arbor separates long‑term strategy from short‑term execution using a two‑level architecture:

Coordinator: acts as the research manager, maintaining the global hypothesis‑tree, observing the current state, proposing new hypotheses, and deciding which branches to explore, merge, or prune.

Executor: runs a single hypothesis in an isolated worktree, modifies the artifact, executes the evaluator, records artifact references, experimental phenomena, and distilled insight, then returns the structured result to the Coordinator.

This division mirrors the dual capabilities needed in real research: a global view of progress and a local ability to implement and test ideas.

Experimental Evaluation

To validate Arbor’s claim of supporting general autonomous research, the authors evaluated it on six real AO tasks covering three research categories:

Model Training: optimizer and architecture design under a fixed budget.

Harness Engineering: improving the control logic of other agents (Terminal‑Bench 2.0, BrowseComp).

Data Synthesis: enhancing data‑generation pipelines for search‑agent and math‑reasoning tasks.

Each task provided an initial artifact, a natural‑language goal, a development evaluator, a held‑out test evaluator, and native metrics. Arbor was compared against two strong single‑trajectory coding agents, Codex and Claude Code, which have similar budgets and can also read files, modify code, and run experiments.

Key results include:

On BrowseComp, Arbor raised held‑out accuracy from 45.33 % (baseline) to 67.67 %, outperforming Codex (50.00 %) and Claude Code (53.33 %).

On Math‑Reasoning Data Synthesis, Arbor’s held‑out pass‑gap improved by 19.79 points, versus 5.21 (Codex) and 7.29 (Claude Code).

On Terminal‑Bench 2.0, Arbor increased the held‑out pass rate from 69.81 % to 77.36 %.

Across all six tasks, Arbor achieved the best held‑out performance, with an average relative gain more than 2.5× that of Codex and Claude Code.

Arbor also attained state‑of‑the‑art results on MLE‑Bench Lite, achieving 86.36 % Any‑Medal with GPT‑5.5, surpassing existing baselines.

Ablation Study

Removing the hypothesis‑tree entirely dropped Any‑Medal from 81.82 % to 63.64 %; removing only the insight‑propagation further reduced it to 54.54 %. This demonstrates that the tree’s ability to propagate distilled insights is crucial, more so than merely having a hierarchical structure.

Analysis and Insights

The authors observe that Arbor’s advantage does not stem from executing more experiments but from organizing them. The hypothesis‑tree captures search space, long‑term memory, and a research record, turning scattered logs into structured knowledge that guides future decisions. Token consumption is comparable to baselines, yet Arbor achieves higher held‑out gains because compute is directed toward maintaining and exploiting structured research state rather than blind trial‑and‑error.

Limitations and Future Work

While Arbor provides a framework for autonomous research, current agents still struggle to generate high‑quality hypotheses, differentiate genuine improvements from over‑fitting, and maintain reliable long‑term memory. Open questions include improving hypothesis generation, better distinguishing signal from noise, extending the approach to longer research cycles, and fostering effective collaboration between human researchers and autonomous agents.

Conclusion

Arbor demonstrates that moving from "task execution" to "autonomous research" requires a mechanism that structures multi‑round exploration into an evolving research state. By externalizing hypotheses, evidence, failures, and insights into a growing hypothesis‑tree, Arbor enables agents to accumulate and reuse knowledge, achieving superior performance on diverse real‑world research tasks.

MLNLP community logo
MLNLP community logo
Arbor overview (hypothesis tree and held‑out gains)
Arbor overview (hypothesis tree and held‑out gains)
Arbor framework diagram
Arbor framework diagram
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Machine LearningLLM agentsAI research automationautonomous researchArbor frameworkhypothesis tree
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.