Artificial Intelligence 12 min read

Building a Multi‑Agent Research Pipeline with OpenClaw: Lessons from Karpathy’s Autoresearch

The article analyzes Karpathy’s Autoresearch project, explains how its 5‑minute experiment constraint enables hundreds of automated runs, and details the author’s three‑layer OpenClaw pipeline that orchestrates search and analysis agents, discusses design decisions, pitfalls, and practical takeaways for engineers.

inShocking

Mar 16, 2026

Building a Multi‑Agent Research Pipeline with OpenClaw: Lessons from Karpathy’s Autoresearch

Karpathy Open‑source Autoresearch

Autoresearch (https://github.com/karpathy/autoresearch) has attracted 37,900 stars, indicating strong interest from the AI engineering community.

ML research bottleneck

Typical language‑model research proceeds as: change hyper‑parameters or architecture → run training (often several hours) → inspect results → repeat. This limits a researcher to 3–5 experiments per day, with most time spent waiting.

Project structure and single constraint

The codebase is only 630 lines of Python, fitting comfortably inside an LLM context window. It consists of three core files: program.md – edited by a human; contains research direction, constraints, and the optimization objective that guide the agent. train.py – edited only by the agent; implements the GPT‑style model and training loop. prepare.py – frozen; provides data processing, tokenizer, and evaluation logic that the agent cannot modify.

The only hard constraint is that each experiment runs for exactly 5 minutes, regardless of GPU cost. After 5 minutes the run stops and the same metric, validation bits‑per‑byte ( val_bpb), is used for fair comparison.

Automated 5‑minute experiment loop

① Read program.md (human‑written research direction)
② Agent proposes a modification to train.py
③ Run training for exactly 5 minutes
④ Compare val_bpb: if it decreases, keep the commit; if it increases, revert
⑤ Return to step ① for the next round

≈12 rounds per hour, ≈100 rounds per night

Results from a full H100 session

In a session on an H100 GPU, 89 experiments reduced val_bpb from 0.9979 to 0.9773.

Halving the batch size from 524 K to 262 K increased the number of gradient‑update steps within the 5‑minute budget and yielded the largest single improvement.

Depth 9 + width 512 outperformed wider networks.

Reducing the context window to one‑eighth improved performance.

Increasing the RoPE base from 10 K to 200 K gave additional gains.

Adding label smoothing caused val_bpb to jump to 0.34; the agent immediately reverted the change.

Distributed version

A distributed run on the Hyperspace network used 35 agents to execute 333 experiments in one night.

Extending to a multi‑agent research pipeline

Building on the Autoresearch loop, a three‑layer architecture was created on top of OpenClaw:

Feishu message → OpenClaw Gateway
    ↓
main agent (depth‑0, generalist) – parses request → sessions_spawn → review lead
    ↓
review lead (depth‑1, research orchestrator) – phases:
  Phase 1: sessions_spawn → search expert (GitHub API, Tavily, Hacker News, Papers with Code) → candidates.csv (engineer‑score 1‑5)
  Phase 2: filter score≥4 → analysis expert → notes/[id].md (source URL) → matrix.csv (comparison scores)
  Phase 3: coverage check → back to Phase 1 if gaps remain
  Phase 4: read review‑writer/SKILL.md → write review.md
  Phase 5: read research‑vault/SKILL.md → persist to Obsidian
  Phase 6: report to user → Feishu reply + TL;DR

Design decisions

Three‑layer separation keeps the main agent generic; the review lead encapsulates all research‑orchestration logic in AGENTS.md, avoiding a monolithic context.

SKILL.md stores procedural handbooks for each worker (search or analysis). Workers receive a task argument such as “read /clawd/skills/paper‑scout/SKILL.md”. Updating SKILL.md instantly changes the behavior of all subsequent calls without modifying the spawn commands.

The scoring system is engineering‑oriented (1‑5), giving higher weight to “can be used” projects than to citation counts.

Pitfalls and fixes

Replacing the controller model with a domestic LLM caused the pipeline to stop spawning sessions because the model treated the sessions_spawn block in AGENTS.md as plain text rather than an executable command. The issue was resolved by using a model with strong instruction‑following ability (Claude Opus 4.6) for the controller, while cheaper models can run the worker agents.

Final report format

## TL;DR          ← 3‑5 sentence summary
## Technical Overview
## Comparison Table
## Detailed Analysis per Option
## Technology Trends
## Implementation Advice (Java‑engineer perspective)
## References

Observed benefits and remaining issues

Higher information density; no need to manually browse multiple platforms.

Effective noise filtering.

Discovery of high‑quality projects that would otherwise be missed.

Occasional link‑verification failures due to network problems.

Chinese‑language comprehension slightly weaker than English.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI agents LLM research automation OpenClaw autoresearch multi-agent pipeline

Written by

inShocking

Occasional sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.