How Small Teams Can Build Deep Research Agents with the OpenResearcher Open‑Source Pipeline
OpenResearcher presents a fully open, reproducible offline pipeline that synthesizes 97,000 long‑horizon research trajectories, enabling a 30B LLM to achieve 54.8% accuracy on BrowseComp‑Plus and surpass leading closed‑source models while eliminating online API costs.
Training a Deep Research Agent that mimics the human loop search → browse → reason is limited not by model capacity but by the scarcity of high‑quality long‑horizon research trajectory data. Existing collection methods either rely on expensive, unstable online search APIs or generate only shallow 2‑5‑step interactions, far short of the dozens or hundreds of steps required for genuine deep research.
The Texas A&M, University of Waterloo, and UC San Diego teams address this bottleneck with OpenResearcher , the first fully open, reproducible offline pipeline that can train models comparable to dedicated systems on long‑range research tasks. The pipeline decouples corpus construction from trajectory generation and proceeds in three stages.
Stage 1 – High‑Difficulty Question Collection
From the MiroVerse‑v0.1 QA set, 10% (≈6,000 Q‑A pairs) are sampled. These questions demand multi‑hop reasoning and heterogeneous evidence, often requiring >100 tool calls even for a strong teacher model.
Stage 2 – Offline Search Engine Construction
For each Q‑A pair, the question and reference answer are concatenated into a query and sent once to the Serper API. After deduplication, ~10,000 “gold” documents containing the answer are collected. These are merged with ~15 million “noise” documents sampled from FineWeb (≈10 trillion tokens) to form a 1.5 × 10⁷‑document offline corpus. All documents are embedded with Qwen3‑Embedding‑8B and indexed via FAISS, guaranteeing that the answer exists in the corpus while preserving realistic web‑scale noise.
Stage 3 – Browsing Modeling and Trajectory Synthesis
Three primitive tools abstract the agent’s browsing behavior:
Search : issue a natural‑language query to the offline engine and retrieve top‑K results (title, URL, snippet).
Open : fetch the full text of a selected URL.
Find : perform exact string matching within the opened document to locate evidence.
Using GPT‑OSS‑120B as the teacher model, 16 diverse trajectories are generated per question, filtered lightly, yielding >97,000 trajectories with 10‑100+ tool calls each.
Model Fine‑Tuning and Results
Approximately 55,000 correct‑answer trajectories are selected to fine‑tune a 30B Nemotron‑3‑Nano‑A3B model (3.2 B active parameters) on 8 × NVIDIA H100 GPUs for ~8 hours—computational resources attainable by small teams. On the offline benchmark BrowseComp‑Plus, the fine‑tuned model reaches 54.8% accuracy, a 34.0‑point absolute gain over the base model and surpasses GPT‑4.1 (36.4%), Claude‑4‑Opus (36.8%), Gemini‑2.5‑Pro (29.5%), DeepSeek‑R1 (16.4%) and the proprietary DeepResearch system (44.5%).
On three online‑search‑dependent benchmarks (BrowseComp, GAIA, xbench‑DeepSearch), OpenResearcher achieves 26.3%, 64.1% and 65.0% respectively, outperforming open‑source competitors such as ASearcher‑QwQ‑32B and WebDancer‑QwQ‑32B. All gains stem solely from offline‑synthesized trajectories; the model never sees online data during training.
Ablation and Insight Studies
Failure analysis shows that unsuccessful trajectories (average 71.7 tool calls) double the call count of successful ones (38.4) and primarily waste search operations, indicating that query‑construction strategy, not exploration depth, drives failure.
Training with only correct trajectories, only incorrect ones, or the full set yields accuracies of 54.81%, 55.06% and 54.46% respectively—process signals (search patterns, tool usage) are as valuable as answer correctness.
Removing the gold‑document collection step drops accuracy from 54.81% to 6.35%, confirming its critical role.
Increasing the maximum exploration budget improves accuracy and gold‑document hit rate up to ~100 steps, after which gains plateau, revealing diminishing returns.
Tool ablation demonstrates that adding open raises accuracy from 43.86% to 56.39%, and adding find further to 62.17%, while also reducing total calls and token consumption.
Any trajectory that opens at least one gold document attains >85% final accuracy; trajectories that never open a gold document fall to 7.9%, underscoring the necessity of “seeing” relevant evidence.
Cost and Practical Benefits
Synthesizing the 97k trajectories required ~5.76 M search requests. Using online APIs would cost $5,760 (Serper) or $28,800 (SerpAPI). OpenResearcher’s offline engine reduces this to $0, eliminates rate limits, ensures deterministic reproducibility, and removes external dependencies, making large‑scale trajectory generation affordable for modest research groups.
In summary, OpenResearcher offers a pragmatic, cost‑free solution to the data bottleneck for deep‑research agents, demonstrates that a 30B LLM can rival much larger closed‑source models when trained on high‑quality offline trajectories, and provides a controllable experimental platform for future optimization of long‑horizon research pipelines.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
