Fully Automated Code and Paper Generation: Claude, Codex, and Autoresearch Variants

The article examines Karpathy's Autoresearch project and its community forks—Codex Autoresearch, Claude Autoresearch, and AutoResearchClaw—detailing their design, experiment loops, core rules, installation steps, and a comparative analysis of capabilities, targets, and limitations for autonomous AI-driven research and development.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Fully Automated Code and Paper Generation: Claude, Codex, and Autoresearch Variants

Karpathy Autoresearch

Project URL: https://github.com/karpathy/autoresearch

Three immutable files define the system: prepare.py – data preparation, evaluation function, dataloader (never modified by the agent). train.py – model architecture, optimizer, training loop (modified by the AI agent). program.md – human‑written Markdown that encodes the agent’s behavior (edited by the user).

The autonomous loop runs forever (or for a fixed number of iterations):

Forever loop:
  1. Check current git status
  2. Modify <code>train.py</code> with a new idea
  3. git commit
  4. Run a 5‑minute experiment
  5. Read result: does <code>val_bpb</code> improve?
  6. If improved → keep and advance branch
  7. If not → git reset (rollback)
  8. Record metrics to <code>results.tsv</code>
  9. Continue to next experiment

Each experiment is fixed at five minutes, yielding roughly twelve experiments per hour; an eight‑hour sleep produces about one hundred experiments, with all metrics, memory usage, and git state logged in results.tsv.

Design philosophy includes:

Fixed‑time budgeting (all experiments run for the same duration).

Simplicity‑first rule (prefer shorter code when performance is equal).

Single evaluation metric: val_bpb (bits per byte on the validation set, lower is better).

Git as experiment memory (commit on success, reset on failure).

Installation example (using the uv package manager):

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync
# Prepare data and tokenizer
uv run prepare.py
# Verify environment with a manual run
uv run train.py

After setup, a single prompt to Claude Code or Codex such as “Hi, have a look at program.md and let’s kick off a new experiment!” starts the loop.

Codex Autoresearch

Project URL: https://github.com/leo-lilinxiao/codex-autoresearch

Codex Autoresearch generalizes the autoresearch paradigm to any software‑engineering task that has a measurable numeric metric. The user describes a goal in one sentence; Codex parses the repository, identifies the metric, and enters an autonomous iteration loop identical to Karpathy’s.

Example goals and corresponding actions:

“Increase test coverage” → scans the repo, defines a coverage metric, writes tests until the target is met.

“Fix 12 failing tests” → iteratively detects and repairs each failure.

“Why does the API return 503?” → performs scientific root‑cause analysis and proposes falsifiable hypotheses.

“Is this code safe?” → runs STRIDE + OWASP audits with code‑level evidence.

The system supports foreground (interactive) and background (unattended) execution modes.

Core loop (shared with Karpathy):

Shared loop (forever or N times):
  1. Review current git state, history, and result log
  2. Choose a hypothesis
  3. Make an atomic code change
  4. git commit
  5. Run verification + safety guard
  6. If improvement → keep; if regression → rollback; if crash → fix or skip
  7. Record result
  8. Health check
  9. After 3 consecutive discards → adjust strategy; after 5 → pivot; after 2 pivots → network search
 10. Repeat

Cross‑run learning extracts “lessons” from each success or failure and injects them into the decision process of the next iteration.

Installation:

git clone https://github.com/leo-lilinxiao/codex-autoresearch.git
cp -r codex-autoresearch your-project/.agents/skills/codex-autoresearch

Typical prompt:

$codex-autoresearch
I want to get rid of all the `any` types in my TypeScript code

Claude Autoresearch

Project URL: https://github.com/uditgoenka/autoresearch

Provides nine ready‑to‑use commands that implement the same autonomous loop with richer configuration and safety checks. Example command list: /autoresearch – core autonomous iteration. /autoresearch:plan – interactive configuration wizard. /autoresearch:security – STRIDE + OWASP security audit. /autoresearch:ship – pre‑release checklist. /autoresearch:debug – scientific bug‑diagnosis. /autoresearch:fix – automatic error fixing. /autoresearch:scenario – scenario‑driven test generation. /autoresearch:predict – multi‑role pre‑analysis. /autoresearch:learn – automatic documentation generation.

Eight core rules (mirroring the original paradigm):

Loop to completion (infinite or N‑step).

Read before write (understand context first).

One change at a time (atomic modifications).

Mechanical verification (use metrics, not intuition).

Automatic rollback on failure.

Simplicity wins (prefer less code for equal effect).

Git as memory (all experiments committed).

When stuck, think deeper (re‑examine, combine near‑successful experiments, try aggressive changes).

Installation via Claude Code plugin marketplace or manual clone:

# Plugin marketplace
/plugin marketplace add uditgoenka/autoresearch
/plugin install autoresearch@autoresearch
# Manual clone
git clone https://github.com/uditgoenka/autoresearch.git
cp -r autoresearch/claude-plugin/skills/autoresearch .claude/skills/autoresearch
cp -r autoresearch/claude-plugin/commands/autoresearch .claude/commands/autoresearch

Example usage:

/autoresearch
Goal: Increase test coverage from 72% to 90%
Scope: src/**/*.test.ts, src/**/*.ts
Metric: coverage % (higher is better)
Verify: npm test -- --coverage | grep "All files"
Guard: npm test

AutoResearchClaw

Project URL: https://github.com/aiming-lab/AutoResearchClaw

Implements a fully autonomous research pipeline that generates complete conference‑style papers. The pipeline consists of 23 stages across eight phases:

Phase A: Research scope definition
  1. Topic initialization
  2. Problem decomposition
Phase B: Literature discovery
  3. Search strategy
  4. Real‑API literature collection
  5. Literature filtering (human gate)
  6. Knowledge extraction
Phase C: Knowledge synthesis
  7. Synthesis
  8. Multi‑agent hypothesis generation
Phase D: Experiment design
  9. Experiment design (human gate)
 10. Code generation
 11. Resource planning
Phase E: Experiment execution
 12. Run experiments
 13. Automatic repair of failed runs
Phase F: Analysis & decision
 14. Multi‑agent result analysis
 15. Research decision (pivot/refine)
Phase G: Paper writing
 16. Outline
 17. Draft
 18. Peer review (evidence check)
 19. Revision
Phase H: Finalization
 20. Quality gate (checks)
 21. Knowledge archiving
 22. Export to LaTeX
 23. Reference verification

Key artifacts produced: paper_draft.md – full manuscript (introduction, related work, method, experiments, conclusion). paper.tex – compilable LaTeX using NeurIPS/ICML/ICLR templates. references.bib – real BibTeX entries fetched from OpenAlex, Semantic Scholar, and arXiv, validated through four verification layers. experiment runs/ – generated experiment code and sandboxed results. charts/ – automatically generated comparison figures. reviews.md – multi‑agent peer‑review report.

Design highlights:

Authentic citations (no fabricated references).

Self‑healing mechanisms (automatic diagnosis and repair of failed experiments; pivoting when hypotheses fail).

Multi‑agent debate for hypothesis generation and result analysis.

Cross‑platform support via the Agent Client Protocol (ACP), enabling Claude Code, Codex CLI, Copilot CLI, Gemini CLI, Kimi CLI, etc.

Sentinel monitoring for NaN/Inf detection, evidence consistency, citation relevance scoring, and anti‑fabrication guards.

Quick start example:

# Clone and install
git clone https://github.com/aiming-lab/AutoResearchClaw.git
cd AutoResearchClaw
python3 -m venv .venv && source .venv/bin/activate
pip install -e .
# Interactive setup
researchclaw setup
researchclaw init
# Run a full pipeline (replace OPENAI_API_KEY)
export OPENAI_API_KEY="sk-..."
researchclaw run --config config.arc.yaml --topic "Your research idea" --auto-approve

The system has demonstrated fully autonomous paper generation in eight domains (mathematics, statistics, biology, computation, NLP, RL, vision, robustness).

Comparative Overview

Core scenario : Karpathy – ML model training; Codex – generic code quality; Claude – generic code quality; AutoResearchClaw – autonomous paper writing.

Agent platform : Karpathy – any; Codex – OpenAI Codex; Claude – Claude Code; AutoResearchClaw – multi‑platform ACP.

Degree of autonomy : Karpathy – high (never stop); Codex – high (background mode); Claude – high (infinite loop); AutoResearchClaw – extremely high (23‑stage pipeline).

Evaluation metric : Karpathy uses val_bpb; Codex and Claude accept custom numeric metrics; AutoResearchClaw employs multi‑dimensional quality review.

GPU requirement : Karpathy needs an NVIDIA GPU (tested on H100); Codex and Claude run on CPU; AutoResearchClaw’s requirement varies by task.

Target audience : Karpathy – ML researchers; Codex/Claude – engineers/developers; AutoResearchClaw – research workers.

Insights and Limitations

The recent surge of autoresearch tools reflects three converging trends:

Agent capabilities have matured to reliably modify code, run tests, and read results.

The paradigm is simple—one metric, one constraint, one loop—making it adaptable to many domains.

Git provides an elegant, built‑in memory for experiment state.

Current limitations:

Only objectives that can be quantified with a numeric metric are supported; subjective goals such as “more elegant code” are not handled.

Long‑running token‑heavy loops incur non‑trivial API costs.

Agents excel at exhaustive search within a defined space but do not replace human breakthrough creativity.

code generationAI agentsClaudegit workflowCodexpaper writingautonomous research
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.