Building a Multi‑Agent Research Pipeline with OpenClaw: Lessons from Karpathy’s Autoresearch
The article analyzes Karpathy’s Autoresearch project, explains how its 5‑minute experiment constraint enables hundreds of automated runs, and details the author’s three‑layer OpenClaw pipeline that orchestrates search and analysis agents, discusses design decisions, pitfalls, and practical takeaways for engineers.
Karpathy Open‑source Autoresearch
Autoresearch (https://github.com/karpathy/autoresearch) has attracted 37,900 stars, indicating strong interest from the AI engineering community.
ML research bottleneck
Typical language‑model research proceeds as: change hyper‑parameters or architecture → run training (often several hours) → inspect results → repeat. This limits a researcher to 3–5 experiments per day, with most time spent waiting.
Project structure and single constraint
The codebase is only 630 lines of Python, fitting comfortably inside an LLM context window. It consists of three core files: program.md – edited by a human; contains research direction, constraints, and the optimization objective that guide the agent. train.py – edited only by the agent; implements the GPT‑style model and training loop. prepare.py – frozen; provides data processing, tokenizer, and evaluation logic that the agent cannot modify.
The only hard constraint is that each experiment runs for exactly 5 minutes, regardless of GPU cost. After 5 minutes the run stops and the same metric, validation bits‑per‑byte ( val_bpb), is used for fair comparison.
Automated 5‑minute experiment loop
① Read program.md (human‑written research direction)
② Agent proposes a modification to train.py
③ Run training for exactly 5 minutes
④ Compare val_bpb: if it decreases, keep the commit; if it increases, revert
⑤ Return to step ① for the next round
≈12 rounds per hour, ≈100 rounds per nightResults from a full H100 session
In a session on an H100 GPU, 89 experiments reduced val_bpb from 0.9979 to 0.9773.
Halving the batch size from 524 K to 262 K increased the number of gradient‑update steps within the 5‑minute budget and yielded the largest single improvement.
Depth 9 + width 512 outperformed wider networks.
Reducing the context window to one‑eighth improved performance.
Increasing the RoPE base from 10 K to 200 K gave additional gains.
Adding label smoothing caused val_bpb to jump to 0.34; the agent immediately reverted the change.
Distributed version
A distributed run on the Hyperspace network used 35 agents to execute 333 experiments in one night.
Extending to a multi‑agent research pipeline
Building on the Autoresearch loop, a three‑layer architecture was created on top of OpenClaw:
Feishu message → OpenClaw Gateway
↓
main agent (depth‑0, generalist) – parses request → sessions_spawn → review lead
↓
review lead (depth‑1, research orchestrator) – phases:
Phase 1: sessions_spawn → search expert (GitHub API, Tavily, Hacker News, Papers with Code) → candidates.csv (engineer‑score 1‑5)
Phase 2: filter score≥4 → analysis expert → notes/[id].md (source URL) → matrix.csv (comparison scores)
Phase 3: coverage check → back to Phase 1 if gaps remain
Phase 4: read review‑writer/SKILL.md → write review.md
Phase 5: read research‑vault/SKILL.md → persist to Obsidian
Phase 6: report to user → Feishu reply + TL;DRDesign decisions
Three‑layer separation keeps the main agent generic; the review lead encapsulates all research‑orchestration logic in AGENTS.md, avoiding a monolithic context.
SKILL.md stores procedural handbooks for each worker (search or analysis). Workers receive a task argument such as “read /clawd/skills/paper‑scout/SKILL.md”. Updating SKILL.md instantly changes the behavior of all subsequent calls without modifying the spawn commands.
The scoring system is engineering‑oriented (1‑5), giving higher weight to “can be used” projects than to citation counts.
Pitfalls and fixes
Replacing the controller model with a domestic LLM caused the pipeline to stop spawning sessions because the model treated the sessions_spawn block in AGENTS.md as plain text rather than an executable command. The issue was resolved by using a model with strong instruction‑following ability (Claude Opus 4.6) for the controller, while cheaper models can run the worker agents.
Final report format
## TL;DR ← 3‑5 sentence summary
## Technical Overview
## Comparison Table
## Detailed Analysis per Option
## Technology Trends
## Implementation Advice (Java‑engineer perspective)
## ReferencesObserved benefits and remaining issues
Higher information density; no need to manually browse multiple platforms.
Effective noise filtering.
Discovery of high‑quality projects that would otherwise be missed.
Occasional link‑verification failures due to network problems.
Chinese‑language comprehension slightly weaker than English.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
