Artificial Intelligence 25 min read

Hands‑On Guide to Karpathy’s Autoresearch: From Setup to Custom Research Loops

This article walks through Karpathy’s open‑source Autoresearch system, explaining its core design principles, file layout, and workflow, and then demonstrates practical AI‑agent applications for code optimization, bug fixing, and article writing, complete with setup commands, code snippets, and example experiment logs.

Frontend AI Walk

Apr 16, 2026

Hands‑On Guide to Karpathy’s Autoresearch: From Setup to Custom Research Loops

Introduction

Many AI‑automation projects get stuck in a single loop: experiments are not comparable, evaluation criteria keep changing, and there is no stable record or rollback, leading to high failure cost.

What Is Autoresearch?

Autoresearch is an open‑source project released by Andrej Karpathy in March 2026. Its core idea is to let an AI Agent autonomously conduct AI‑research experiments.

Workflow

Give the AI Agent a real LLM training environment
      ↓
Agent modifies the training code
      ↓
Train for 5 minutes, evaluate the result
      ↓
Keep improvements, discard failures
      ↓
Repeat experiments, continuously optimise
      ↓
Wake up in the morning with a better model

Core File Structure

autoresearch/
├── prepare.py        # data preparation (fixed)
├── train.py          # training code (Agent modifies this)
├── program.md        # human‑written instructions
├── pyproject.toml   # dependencies
├── analysis.ipynb   # analysis notebook
└── README.md        # documentation

Core Design Principles

1. Single‑File Modification

The Agent is only allowed to edit train.py. Advantages:

✅ Controlled scope – the rest of the project stays intact.

✅ Clear diffs – each change is easy to review.

✅ Lower risk – core infrastructure is never broken.

Example program.md rule:

# Your task
You are an AI research assistant improving model performance.

## Rules
1. Only modify <code>train.py</code>
2. Each experiment runs for 5 minutes
3. Goal is to lower <code>val_bpb</code>

2. Fixed Time Budget

Each experiment runs for a fixed wall‑clock time of 5 minutes.

✅ Directly comparable experiments regardless of model size or batch size.

✅ Predictable output – about 12 experiments per hour, 100 experiments per night.

✅ Automatic optimisation – the system finds the model that best fits the hardware.

Python example for a 5‑minute run:

def run_experiment(agent_proposal):
    start = time.time()
    timeout = 5 * 60  # 5 minutes
    result = execute_and_test(agent_proposal)
    if time.time() - start > timeout:
        return {"status": "timeout", "partial_result": result}
    return {"status": "completed", "result": result}

3. Single Metric

The system evaluates only one metric: val_bpb (validation bits‑per‑byte). Advantages:

✅ Clear objective – lower is better.

✅ Architecture‑agnostic – fair comparison across model changes.

✅ Fully automated – no human judgement needed.

Example evaluation for code quality (single score 0‑100):

def evaluate_code_quality(code):
    metrics = {
        'tests_passed': run_tests(code),
        'complexity': calculate_complexity(),
        'performance': benchmark(code),
        'security': security_scan(code),
    }
    score = (
        metrics['tests_passed'] * 0.4 +
        (100 - metrics['complexity']) * 0.2 +
        metrics['performance'] * 0.2 +
        metrics['security'] * 0.2
    )
    return score

Example evaluation for article quality (single score 0‑10):

def evaluate_article_quality(article):
    metrics = {
        'accuracy': check_facts(article),
        'readability': calculate_readability(),
        'engagement': predict_engagement(),
        'seo': seo_score(article),
        'style': style_consistency(article),
    }
    score = (
        metrics['accuracy'] * 0.3 +
        metrics['readability'] * 0.25 +
        metrics['engagement'] * 0.2 +
        metrics['seo'] * 0.15 +
        metrics['style'] * 0.1
    )
    return score

4. Self‑Contained

The project has no external dependencies except PyTorch and a few Python packages.

✅ Easy to understand – transparent code, no black‑box.

✅ Easy to debug – problems are easy to locate.

✅ Easy to extend – clear entry points.

Project principles (excerpt from README):

# Project principles
1. **Minimal dependencies** – only necessary libraries
2. **Single file** – core logic lives in one file
3. **No config** – sensible defaults are hard‑coded
4. **Self‑contained** – run <code>python main.py</code> to start

Technical Architecture

Core Components

prepare.py (fixed) : data download, BPE tokenizer training, DataLoader, evaluation functions. Human‑written, never modified by the Agent.

train.py (Agent modifies) : defines the GPT model, optimizer (Muon + AdamW), training loop, and evaluation logic. The Agent experiments by editing this file.

program.md (human writes) : contains the task description and rules for the Agent.

Work Loop (pseudocode)

best_bpb = float('inf')
for experiment in range(100):  # 100 experiments
    proposal = agent.propose_change()          # 1. Agent proposes a change
    apply_change(proposal)                    # 2. Apply the change
    bpb = train_and_evaluate()                # 3. Train & evaluate (5 min)
    if bpb < best_bpb:                         # 4. Improvement?
        best_bpb = bpb
        save_change(proposal)
        print(f"✅ Improvement! new bpb: {bpb}")
    else:
        revert_change(proposal)
        print(f"❌ No improvement, bpb: {bpb}")
    log_experiment(experiment, proposal, bpb)  # 5. Log the result

AI Programming Practical Applications

Scenario 1: Automatic Code Optimisation

Goal : let the AI autonomously improve code performance.

Project structure :

auto-code-optimizer/
├── program.md          # Agent instructions
├── src/
│   └── main.py        # Agent modifies this file
├── tests/
│   └── test_main.py   # Fixed tests
├── benchmark.py       # Fixed performance benchmark
└── evaluate.py        # Fixed evaluation script

Sample program.md template (rules and current state):

# AI Code Optimiser
## Your role
You are a code optimisation expert improving execution speed.

## Task
Optimise <code>src/main.py</code>.

## Rules
1. **Only modify** <code>src/main.py</code>
2. **Do not modify** <code>tests/</code>, <code>benchmark.py</code>, <code>evaluate.py</code>
3. **Evaluation** – performance (60%), test pass rate (30%), readability (10%)
4. **Time limit** – each experiment 5 minutes

## Current state
- Baseline performance: 1000 ms
- Test pass rate: 100%
- Lines of code: 500

## Optimisation directions
1. Algorithm optimisation (time‑complexity)
2. Data‑structure choice
3. Cache optimisation
4. Parallelisation
5. Reduce memory allocation

## Experiment flow
1. Read current code
2. Propose optimisation
3. Modify code
4. Run <code>pytest tests/</code>
5. Run <code>python benchmark.py</code>
6. Evaluate results

## Success criteria
- Performance improvement >10%
- Test pass rate = 100%
- Readability does not drop

## Submission format
```python
# Modification notes
- Optimisation point 1: ...
- Optimisation point 2: ...

# Expected improvement
- Performance: +X%
- Tests: 100% pass
```

Key parts of evaluate.py (benchmark, test runner, scoring):

# Run tests and return pass rate
result = subprocess.run(['pytest', 'tests/', '-q'], capture_output=True)
passed = result.stdout.count(b'.')
total = passed + result.stdout.count(b'F')
pass_rate = passed / total if total > 0 else 0

# Run benchmark 5 times and take the best (ms)
for _ in range(5):
    start = time.time()
    subprocess.run(['python', 'src/main.py'], check=True)
    times.append(time.time() - start)
benchmark_time = min(times) * 1000

# Simple code‑quality placeholder
code_quality = 80

performance_score = max(0, 100 - (benchmark_time - 1000) * 0.1)
total_score = performance_score * 0.6 + pass_rate * 100 * 0.3 + code_quality * 0.1
print(f"综合分数：{total_score:.1f}/100")

Typical run flow:

# 1. Initialise project
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch

# 2. Install dependencies
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv run python --version

# 3. Prepare data
uv run prepare.py

# 4. Run baseline experiment
uv run train.py   # → val_bpb: 1.234

# 5. Start AI Agent (Claude, Codex, OpenClaw, …)
# Prompt example: "Read program.md and propose an improvement to train.py."

# 6. Observe experiment log
tail -f results.tsv
# Example log line: 2 | 2026‑03‑07 10:05 | 1.220 | ✅ +1.1%

Scenario 2: Automatic Bug Fixing

Goal : let the AI discover and fix bugs autonomously.

Project structure :

auto-bug-fixer/
├── program.md          # Agent instructions
├── src/
│   └── app.py         # Code with bugs
├── tests/
│   └── test_app.py    # Failing tests
├── bug_report.md      # Bug description
└── verify.py          # Verification script

Sample program.md (rules and flow):

# AI Bug Fixer
## Your task
Analyse the failing test, fix the bug in <code>src/app.py</code>.

## Input
- Source code: <code>src/app.py</code>
- Failing test: <code>tests/test_app.py</code>
- Bug report: <code>bug_report.md</code>

## Output
- Fixed code
- Explanation of the fix

## Rules
1. Only modify <code>src/app.py</code>
2. All tests must pass after the change
3. No new bugs may be introduced
4. Keep code style consistent

## Flow
1. Read bug report
2. Run tests to confirm failure
3. Analyse code and locate issue
4. Propose a fix
5. Modify code
6. Run tests again
7. Submit the fix

## Success criteria
- All tests pass
- No regression bugs
- Code quality does not drop

Verification script ( verify.py) runs the test suite and reports success or failure.

AI Writing Practical Applications

Scenario 1: Automatic Article Optimisation

Goal : let the AI improve article quality autonomously.

Project structure :

auto-article-optimizer/
├── program.md          # Agent instructions
├── articles/
│   └── draft.md       # Draft to be optimised
├── style_guide.md      # Style guide
├── evaluate.py         # Evaluation script
└── output/
    └── optimized.md   # Optimised article

Key parts of program.md (rules, current state, optimisation directions, flow, success criteria):

# AI Article Optimiser
## Your role
You are a professional editor improving article quality.

## Task
Optimise <code>articles/draft.md</code> to raise the overall quality score.

## Evaluation standards (weights)
- **Accuracy** (30%) – factual correctness
- **Readability** (25%) – fluency and logic
- **Engagement** (20%) – hook and conclusion
- **SEO** (15%) – keyword density, structure
- **Style** (10%) – adherence to style guide

## Rules
1. Keep core ideas unchanged
2. You may restructure paragraphs
3. You may add or remove examples and arguments
4. Must follow <code>style_guide.md</code>
5. Run evaluation after each change

## Current state
- Draft length: 2000 words
- Current score: 6.5/10
- Target score: 9.0/10

## Optimisation directions
1. Title optimisation – more compelling, include keywords
2. Opening improvement – capture reader in first 100 words
3. Structure re‑organisation – clearer logic
4. Add concrete examples
5. Strengthen conclusion – clear call‑to‑action
6. Language polishing – remove redundancy

## Flow
1. Read draft and style guide
2. Run evaluation to get baseline
3. Analyse weak points, devise a plan
4. Modify the article
5. Re‑run evaluation; if score improves, keep the change
6. Repeat steps 3‑5 until target is reached

## Submission format
```markdown
# Modification notes
1. Title: "..." → "..."
2. Opening: rewrote to add a story
3. Structure: moved section X before Y
4. Added 2 concrete case studies
5. Conclusion: added actionable summary

# Expected improvement
- Accuracy: 6.5 → 7.0
- Readability: 6.0 → 8.0
- Engagement: 6.5 → 8.5
- Overall: 6.5 → 7.8
```

Evaluation script ( evaluate.py) computes the five sub‑metrics and aggregates them with the weights shown above, then prints a detailed breakdown.

Typical run logs show progressive improvements, e.g.:

# 1. Optimise title + opening, score: 7.2 ✅ keep
# 2. Re‑structure, score: 7.8 ✅ keep
# 3. Add examples, score: 8.3 ✅ keep
# 4. Polish language, score: 8.7 ✅ keep
# 5. Strengthen conclusion, score: 9.1 ✅ target reached

Scenario 2: Automatic Content Generation

Goal : let the AI generate high‑quality articles from a topic list.

Project structure :

auto-content-generator/
├── program.md          # Agent instructions
├── topics/
│   └── topic_list.md  # List of topics
├── templates/
│   └── article.md     # Article template
├── style_guide.md      # Style guide
├── evaluate.py         # Evaluation script
└── output/articles/    # Generated articles

Key points of program.md:

Generate an article for each topic. Quality standards: overall score ≥ 9.0/10, length 2000‑3000 words, originality > 90 %, factual accuracy = 100 %. Process: select a topic → research → generate using template → evaluate → if score < 9.0, optimise and re‑evaluate → save when the target is met.

Quick‑Start Guide

Step 1: Clone the original project

# Clone Karpathy's autoresearch
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
# List project files
ls -la
# You will see prepare.py, train.py, program.md, …

Step 2: Install dependencies

# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install project dependencies
uv sync
# Verify installation
uv run python --version

Step 3: Prepare data

# Download training data and train the tokenizer
uv run prepare.py
# Output: data preparation complete

Step 4: Run a baseline experiment

# Manually run one training pass
uv run train.py
# Output example: val_bpb: 1.234

Step 5: Start the AI Agent

# Use your favourite AI assistant (Claude, Codex, OpenClaw, …)
# Prompt example:
"Read program.md and start a new experiment. First read the current train.py, then propose an improvement."

Step 6: Observe experiment logs

# Watch results in real time
tail -f results.tsv
# Format: experiment_id | timestamp | val_bpb | improvement?
# Example lines:
# 1 | 2026‑03‑07 10:00 | 1.234 | baseline
# 2 | 2026‑03‑07 10:05 | 1.220 | ✅ +1.1%
# 3 | 2026‑03‑07 10:10 | 1.235 | ❌ -0.1%

Conclusion

Core Design Recap

Single‑file modification – controls risk and makes diffs easy.

Fixed time budget – ensures experiments are comparable and optimisable.

Single metric – provides a clear, architecture‑agnostic objective.

Self‑contained – keeps the system easy to understand, debug and extend.

Key Takeaways

✅ AI programming : automatic code optimisation and bug fixing.

✅ AI writing : automatic article optimisation and content generation.

✅ Rapid start : clone the repo, install dependencies, run the baseline, and launch an AI agent.

✅ Continuous iteration : multiple agents and human feedback drive ongoing improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python automation Prompt Engineering AI Agent benchmark AutoResearch Karpathy self‑research

Written by

Frontend AI Walk

Looking for a one‑stop platform that deeply merges frontend development with AI? This community focuses on intelligent frontend tech, offering cutting‑edge insights, practical implementation experience, toolchain innovations, and rich content to help developers quickly break through in the AI‑driven frontend era.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

What Is Autoresearch?

Workflow

Core File Structure

Core Design Principles

1. Single‑File Modification

2. Fixed Time Budget

3. Single Metric

4. Self‑Contained

Technical Architecture

Core Components

Work Loop (pseudocode)

AI Programming Practical Applications

Scenario 1: Automatic Code Optimisation

Scenario 2: Automatic Bug Fixing

AI Writing Practical Applications

Scenario 1: Automatic Article Optimisation

Scenario 2: Automatic Content Generation

Quick‑Start Guide

Step 1: Clone the original project

Step 2: Install dependencies

Step 3: Prepare data

Step 4: Run a baseline experiment

Step 5: Start the AI Agent

Step 6: Observe experiment logs

Conclusion

Core Design Recap

Key Takeaways

Frontend AI Walk

How this landed with the community

Was this worth your time?

0 Comments

Scenario 1: Automatic Code Optimisation

Scenario 2: Automatic Bug Fixing

Scenario 1: Automatic Article Optimisation

Scenario 2: Automatic Content Generation

Step 1: Clone the original project

Step 2: Install dependencies

Step 3: Prepare data

Step 4: Run a baseline experiment

Step 5: Start the AI Agent

Step 6: Observe experiment logs