Fully Automated Code and Paper Generation: Claude, Codex, and Autoresearch Variants
The article examines Karpathy's Autoresearch project and its community forks—Codex Autoresearch, Claude Autoresearch, and AutoResearchClaw—detailing their design, experiment loops, core rules, installation steps, and a comparative analysis of capabilities, targets, and limitations for autonomous AI-driven research and development.
Karpathy Autoresearch
Project URL: https://github.com/karpathy/autoresearch
Three immutable files define the system: prepare.py – data preparation, evaluation function, dataloader (never modified by the agent). train.py – model architecture, optimizer, training loop (modified by the AI agent). program.md – human‑written Markdown that encodes the agent’s behavior (edited by the user).
The autonomous loop runs forever (or for a fixed number of iterations):
Forever loop:
1. Check current git status
2. Modify <code>train.py</code> with a new idea
3. git commit
4. Run a 5‑minute experiment
5. Read result: does <code>val_bpb</code> improve?
6. If improved → keep and advance branch
7. If not → git reset (rollback)
8. Record metrics to <code>results.tsv</code>
9. Continue to next experimentEach experiment is fixed at five minutes, yielding roughly twelve experiments per hour; an eight‑hour sleep produces about one hundred experiments, with all metrics, memory usage, and git state logged in results.tsv.
Design philosophy includes:
Fixed‑time budgeting (all experiments run for the same duration).
Simplicity‑first rule (prefer shorter code when performance is equal).
Single evaluation metric: val_bpb (bits per byte on the validation set, lower is better).
Git as experiment memory (commit on success, reset on failure).
Installation example (using the uv package manager):
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync
# Prepare data and tokenizer
uv run prepare.py
# Verify environment with a manual run
uv run train.pyAfter setup, a single prompt to Claude Code or Codex such as “Hi, have a look at program.md and let’s kick off a new experiment!” starts the loop.
Codex Autoresearch
Project URL: https://github.com/leo-lilinxiao/codex-autoresearch
Codex Autoresearch generalizes the autoresearch paradigm to any software‑engineering task that has a measurable numeric metric. The user describes a goal in one sentence; Codex parses the repository, identifies the metric, and enters an autonomous iteration loop identical to Karpathy’s.
Example goals and corresponding actions:
“Increase test coverage” → scans the repo, defines a coverage metric, writes tests until the target is met.
“Fix 12 failing tests” → iteratively detects and repairs each failure.
“Why does the API return 503?” → performs scientific root‑cause analysis and proposes falsifiable hypotheses.
“Is this code safe?” → runs STRIDE + OWASP audits with code‑level evidence.
The system supports foreground (interactive) and background (unattended) execution modes.
Core loop (shared with Karpathy):
Shared loop (forever or N times):
1. Review current git state, history, and result log
2. Choose a hypothesis
3. Make an atomic code change
4. git commit
5. Run verification + safety guard
6. If improvement → keep; if regression → rollback; if crash → fix or skip
7. Record result
8. Health check
9. After 3 consecutive discards → adjust strategy; after 5 → pivot; after 2 pivots → network search
10. RepeatCross‑run learning extracts “lessons” from each success or failure and injects them into the decision process of the next iteration.
Installation:
git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearchTypical prompt:
$codex-autoresearch
I want to get rid of all the `any` types in my TypeScript codeClaude Autoresearch
Project URL: https://github.com/uditgoenka/autoresearch
Provides nine ready‑to‑use commands that implement the same autonomous loop with richer configuration and safety checks. Example command list: /autoresearch – core autonomous iteration. /autoresearch:plan – interactive configuration wizard. /autoresearch:security – STRIDE + OWASP security audit. /autoresearch:ship – pre‑release checklist. /autoresearch:debug – scientific bug‑diagnosis. /autoresearch:fix – automatic error fixing. /autoresearch:scenario – scenario‑driven test generation. /autoresearch:predict – multi‑role pre‑analysis. /autoresearch:learn – automatic documentation generation.
Eight core rules (mirroring the original paradigm):
Loop to completion (infinite or N‑step).
Read before write (understand context first).
One change at a time (atomic modifications).
Mechanical verification (use metrics, not intuition).
Automatic rollback on failure.
Simplicity wins (prefer less code for equal effect).
Git as memory (all experiments committed).
When stuck, think deeper (re‑examine, combine near‑successful experiments, try aggressive changes).
Installation via Claude Code plugin marketplace or manual clone:
# Plugin marketplace
/plugin marketplace add uditgoenka/autoresearch
/plugin install autoresearch@autoresearch
# Manual clone
git clone https://github.com/uditgoenka/autoresearch.git
cp -r autoresearch/claude-plugin/skills/autoresearch .claude/skills/autoresearch
cp -r autoresearch/claude-plugin/commands/autoresearch .claude/commands/autoresearchExample usage:
/autoresearch
Goal: Increase test coverage from 72% to 90%
Scope: src/**/*.test.ts, src/**/*.ts
Metric: coverage % (higher is better)
Verify: npm test -- --coverage | grep "All files"
Guard: npm testAutoResearchClaw
Project URL: https://github.com/aiming-lab/AutoResearchClaw
Implements a fully autonomous research pipeline that generates complete conference‑style papers. The pipeline consists of 23 stages across eight phases:
Phase A: Research scope definition
1. Topic initialization
2. Problem decomposition
Phase B: Literature discovery
3. Search strategy
4. Real‑API literature collection
5. Literature filtering (human gate)
6. Knowledge extraction
Phase C: Knowledge synthesis
7. Synthesis
8. Multi‑agent hypothesis generation
Phase D: Experiment design
9. Experiment design (human gate)
10. Code generation
11. Resource planning
Phase E: Experiment execution
12. Run experiments
13. Automatic repair of failed runs
Phase F: Analysis & decision
14. Multi‑agent result analysis
15. Research decision (pivot/refine)
Phase G: Paper writing
16. Outline
17. Draft
18. Peer review (evidence check)
19. Revision
Phase H: Finalization
20. Quality gate (checks)
21. Knowledge archiving
22. Export to LaTeX
23. Reference verificationKey artifacts produced: paper_draft.md – full manuscript (introduction, related work, method, experiments, conclusion). paper.tex – compilable LaTeX using NeurIPS/ICML/ICLR templates. references.bib – real BibTeX entries fetched from OpenAlex, Semantic Scholar, and arXiv, validated through four verification layers. experiment runs/ – generated experiment code and sandboxed results. charts/ – automatically generated comparison figures. reviews.md – multi‑agent peer‑review report.
Design highlights:
Authentic citations (no fabricated references).
Self‑healing mechanisms (automatic diagnosis and repair of failed experiments; pivoting when hypotheses fail).
Multi‑agent debate for hypothesis generation and result analysis.
Cross‑platform support via the Agent Client Protocol (ACP), enabling Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI, etc.
Sentinel monitoring for NaN/Inf detection, evidence consistency, citation relevance scoring, and anti‑fabrication guards.
Quick start example:
# Clone and install
git clone https://github.com/aiming-lab/AutoResearchClaw.git
cd AutoResearchClaw
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
# Interactive setup
researchclaw setup
researchclaw init
# Run a full pipeline (replace OPENAI_API_KEY)
export OPENAI_API_KEY="sk-..."
researchclaw run --config config.arc.yaml --topic "Your research idea" --auto-approveThe system has demonstrated fully autonomous paper generation in eight domains (mathematics, statistics, biology, computation, NLP, RL, vision, robustness).
Comparative Overview
Core scenario : Karpathy – ML model training; Codex – generic code quality; Claude – generic code quality; AutoResearchClaw – autonomous paper writing.
Agent platform : Karpathy – any; Codex – OpenAI Codex; Claude – Claude Code; AutoResearchClaw – multi‑platform ACP.
Degree of autonomy : Karpathy – high (never stop); Codex – high (background mode); Claude – high (infinite loop); AutoResearchClaw – extremely high (23‑stage pipeline).
Evaluation metric : Karpathy uses val_bpb; Codex and Claude accept custom numeric metrics; AutoResearchClaw employs multi‑dimensional quality review.
GPU requirement : Karpathy needs an NVIDIA GPU (tested on H100); Codex and Claude run on CPU; AutoResearchClaw’s requirement varies by task.
Target audience : Karpathy – ML researchers; Codex/Claude – engineers/developers; AutoResearchClaw – research workers.
Insights and Limitations
The recent surge of autoresearch tools reflects three converging trends:
Agent capabilities have matured to reliably modify code, run tests, and read results.
The paradigm is simple—one metric, one constraint, one loop—making it adaptable to many domains.
Git provides an elegant, built‑in memory for experiment state.
Current limitations:
Only objectives that can be quantified with a numeric metric are supported; subjective goals such as “more elegant code” are not handled.
Long‑running token‑heavy loops incur non‑trivial API costs.
Agents excel at exhaustive search within a defined space but do not replace human breakthrough creativity.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
