Arbor Boosts Autonomous Research Performance 150% Over Claude Code
Arbor, a collaborative framework from RUC and Microsoft, uses Hypothesis‑Tree Refinement to turn short‑lived experiments into lasting research progress, achieving over 2.5× held‑out gains across six autonomous optimization tasks and setting a new SOTA on MLE‑Bench Lite.
Although Codex and Claude Code can read code, invoke tools, modify files, and run experiments, they struggle to turn repeated trial‑and‑error into genuine research progress; attempts often restart from scratch and improvements lack clear explanation.
Defining Autonomous Optimization
Arbor formalizes the problem as Autonomous Optimization (AO): given an initial artifact (e.g., training code, agent harness, or baseline), a research goal, and an executable evaluator, an agent iteratively improves the artifact on a development set without step‑by‑step human supervision and validates real gains on a held‑out test set.
Core Mechanism – Hypothesis‑Tree Refinement (HTR)
Arbor externalizes the research process into a continuously evolving hypothesis tree. Each node contains:
Hypothesis : the research claim to be tested.
Artifact version : the specific code, configuration, or data pipeline change.
Experimental evidence : development‑set scores, logs, error messages, execution status, and necessary held‑out results.
Distilled insight : reusable knowledge about why the experiment succeeded or failed, under what conditions it holds, and whether the direction should be pursued further.
After validating a hypothesis, Arbor extracts the distilled insight and propagates it upward, updating the global understanding of the research space. This enables the tree to serve simultaneously as a search space, long‑term memory, and research record.
Two‑Level Architecture
Arbor separates responsibilities into:
Coordinator : a long‑lived component that maintains the global hypothesis tree, generates new hypotheses, selects promising directions, and decides whether to continue, prune, or merge results.
Executor : a short‑lived worker that isolates a single hypothesis in a worktree, modifies code, runs the evaluator, diagnoses failures, and returns structured scores, artifact references, experimental phenomena, and distilled insight.
This separation keeps global strategy clear while allowing each experiment to be traced back to its originating node, and Git is used to materialize artifact changes.
Observe research state → Propose candidate hypotheses → Choose exploration direction → Dispatch experiment → Return structured evidence → Abstract insight → Decide merge, prune, or continue
A held‑out merge gate prevents overfitting: a candidate is merged into the best artifact only if it outperforms the current optimum on the held‑out evaluator, while development feedback guides exploration.
Arbor is released as both a full CLI for long‑running automation and as individual Agent Skills that can be loaded into environments such as Codex or Claude Code.
Experiments – Six Real‑World AO Tasks
Arbor was evaluated on six tasks covering model training (optimizer and architecture design), harness engineering (Terminal‑Bench 2.0 and BrowseComp), and data synthesis (search‑agent and math‑reasoning pipelines). The baseline comparisons were Codex and Claude Code.
Key held‑out results include:
BrowseComp: baseline 45.33 % → Codex 50.00 % → Claude Code 53.33 % → Arbor 67.67 %.
Math‑Reasoning synthesis: Arbor improves held‑out pass‑gap by 19.79 points, versus 5.21 (Codex) and 7.29 (Claude Code).
Terminal‑Bench 2.0: held‑out pass rate rises from 69.81 % to 77.36 % with Arbor.
Beyond the six tasks, Arbor paired with GPT‑5.5 achieved 86.36 % Any‑Medal on MLE‑Bench Lite, surpassing all existing results and establishing a new state‑of‑the‑art.
Analysis – Organization Beats Quantity
Token‑budget logs show Arbor consumes a comparable amount of tokens to Claude Code yet yields far larger held‑out gains. The advantage stems from allocating compute to maintain competing hypotheses, run isolated experiments, compare evidence, and update the search tree, rather than exhaustively pursuing a single trajectory.
Ablation studies on MLE‑Bench Lite reveal the importance of the hypothesis tree and insight propagation. Removing the tree drops Any‑Medal from 81.82 % to 63.64 %; removing insight propagation while keeping the tree further drops performance to 54.54 %, indicating that insight transmission is more critical than the mere hierarchical structure.
These findings suggest that effective research automation relies on structured organization and memory of past experiments, not merely on increasing the number of trials.
Paper title: Toward Generalist Autonomous Research via Hypothesis-Tree Refinement
Paper link: https://arxiv.org/pdf/2606.11926
Code repository: https://github.com/RUC-NLPIR/Arbor
Project homepage: https://ruc-nlpir.github.io/Arbor/Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
