Hands‑On Guide to Karpathy’s Autoresearch: From Setup to Custom Research Loops
This article walks through Karpathy’s open‑source Autoresearch system, explaining its core design principles, file layout, and workflow, and then demonstrates practical AI‑agent applications for code optimization, bug fixing, and article writing, complete with setup commands, code snippets, and example experiment logs.
Introduction
Many AI‑automation projects get stuck in a single loop: experiments are not comparable, evaluation criteria keep changing, and there is no stable record or rollback, leading to high failure cost.
What Is Autoresearch?
Autoresearch is an open‑source project released by Andrej Karpathy in March 2026. Its core idea is to let an AI Agent autonomously conduct AI‑research experiments.
Workflow
Give the AI Agent a real LLM training environment
↓
Agent modifies the training code
↓
Train for 5 minutes, evaluate the result
↓
Keep improvements, discard failures
↓
Repeat experiments, continuously optimise
↓
Wake up in the morning with a better modelCore File Structure
autoresearch/
├── prepare.py # data preparation (fixed)
├── train.py # training code (Agent modifies this)
├── program.md # human‑written instructions
├── pyproject.toml # dependencies
├── analysis.ipynb # analysis notebook
└── README.md # documentationCore Design Principles
1. Single‑File Modification
The Agent is only allowed to edit train.py. Advantages:
✅ Controlled scope – the rest of the project stays intact.
✅ Clear diffs – each change is easy to review.
✅ Lower risk – core infrastructure is never broken.
Example program.md rule:
# Your task
You are an AI research assistant improving model performance.
## Rules
1. Only modify <code>train.py</code>
2. Each experiment runs for 5 minutes
3. Goal is to lower <code>val_bpb</code>2. Fixed Time Budget
Each experiment runs for a fixed wall‑clock time of 5 minutes.
✅ Directly comparable experiments regardless of model size or batch size.
✅ Predictable output – about 12 experiments per hour, 100 experiments per night.
✅ Automatic optimisation – the system finds the model that best fits the hardware.
Python example for a 5‑minute run:
def run_experiment(agent_proposal):
start = time.time()
timeout = 5 * 60 # 5 minutes
result = execute_and_test(agent_proposal)
if time.time() - start > timeout:
return {"status": "timeout", "partial_result": result}
return {"status": "completed", "result": result}3. Single Metric
The system evaluates only one metric: val_bpb (validation bits‑per‑byte). Advantages:
✅ Clear objective – lower is better.
✅ Architecture‑agnostic – fair comparison across model changes.
✅ Fully automated – no human judgement needed.
Example evaluation for code quality (single score 0‑100):
def evaluate_code_quality(code):
metrics = {
'tests_passed': run_tests(code),
'complexity': calculate_complexity(),
'performance': benchmark(code),
'security': security_scan(code),
}
score = (
metrics['tests_passed'] * 0.4 +
(100 - metrics['complexity']) * 0.2 +
metrics['performance'] * 0.2 +
metrics['security'] * 0.2
)
return scoreExample evaluation for article quality (single score 0‑10):
def evaluate_article_quality(article):
metrics = {
'accuracy': check_facts(article),
'readability': calculate_readability(),
'engagement': predict_engagement(),
'seo': seo_score(article),
'style': style_consistency(article),
}
score = (
metrics['accuracy'] * 0.3 +
metrics['readability'] * 0.25 +
metrics['engagement'] * 0.2 +
metrics['seo'] * 0.15 +
metrics['style'] * 0.1
)
return score4. Self‑Contained
The project has no external dependencies except PyTorch and a few Python packages.
✅ Easy to understand – transparent code, no black‑box.
✅ Easy to debug – problems are easy to locate.
✅ Easy to extend – clear entry points.
Project principles (excerpt from README):
# Project principles
1. **Minimal dependencies** – only necessary libraries
2. **Single file** – core logic lives in one file
3. **No config** – sensible defaults are hard‑coded
4. **Self‑contained** – run <code>python main.py</code> to startTechnical Architecture
Core Components
prepare.py (fixed) : data download, BPE tokenizer training, DataLoader, evaluation functions. Human‑written, never modified by the Agent.
train.py (Agent modifies) : defines the GPT model, optimizer (Muon + AdamW), training loop, and evaluation logic. The Agent experiments by editing this file.
program.md (human writes) : contains the task description and rules for the Agent.
Work Loop (pseudocode)
best_bpb = float('inf')
for experiment in range(100): # 100 experiments
proposal = agent.propose_change() # 1. Agent proposes a change
apply_change(proposal) # 2. Apply the change
bpb = train_and_evaluate() # 3. Train & evaluate (5 min)
if bpb < best_bpb: # 4. Improvement?
best_bpb = bpb
save_change(proposal)
print(f"✅ Improvement! new bpb: {bpb}")
else:
revert_change(proposal)
print(f"❌ No improvement, bpb: {bpb}")
log_experiment(experiment, proposal, bpb) # 5. Log the resultAI Programming Practical Applications
Scenario 1: Automatic Code Optimisation
Goal : let the AI autonomously improve code performance.
Project structure :
auto-code-optimizer/
├── program.md # Agent instructions
├── src/
│ └── main.py # Agent modifies this file
├── tests/
│ └── test_main.py # Fixed tests
├── benchmark.py # Fixed performance benchmark
└── evaluate.py # Fixed evaluation scriptSample program.md template (rules and current state):
# AI Code Optimiser
## Your role
You are a code optimisation expert improving execution speed.
## Task
Optimise <code>src/main.py</code>.
## Rules
1. **Only modify** <code>src/main.py</code>
2. **Do not modify** <code>tests/</code>, <code>benchmark.py</code>, <code>evaluate.py</code>
3. **Evaluation** – performance (60%), test pass rate (30%), readability (10%)
4. **Time limit** – each experiment 5 minutes
## Current state
- Baseline performance: 1000 ms
- Test pass rate: 100%
- Lines of code: 500
## Optimisation directions
1. Algorithm optimisation (time‑complexity)
2. Data‑structure choice
3. Cache optimisation
4. Parallelisation
5. Reduce memory allocation
## Experiment flow
1. Read current code
2. Propose optimisation
3. Modify code
4. Run <code>pytest tests/</code>
5. Run <code>python benchmark.py</code>
6. Evaluate results
## Success criteria
- Performance improvement >10%
- Test pass rate = 100%
- Readability does not drop
## Submission format
```python
# Modification notes
- Optimisation point 1: ...
- Optimisation point 2: ...
# Expected improvement
- Performance: +X%
- Tests: 100% pass
```Key parts of evaluate.py (benchmark, test runner, scoring):
# Run tests and return pass rate
result = subprocess.run(['pytest', 'tests/', '-q'], capture_output=True)
passed = result.stdout.count(b'.')
total = passed + result.stdout.count(b'F')
pass_rate = passed / total if total > 0 else 0
# Run benchmark 5 times and take the best (ms)
for _ in range(5):
start = time.time()
subprocess.run(['python', 'src/main.py'], check=True)
times.append(time.time() - start)
benchmark_time = min(times) * 1000
# Simple code‑quality placeholder
code_quality = 80
performance_score = max(0, 100 - (benchmark_time - 1000) * 0.1)
total_score = performance_score * 0.6 + pass_rate * 100 * 0.3 + code_quality * 0.1
print(f"综合分数:{total_score:.1f}/100")Typical run flow:
# 1. Initialise project
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
# 2. Install dependencies
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv run python --version
# 3. Prepare data
uv run prepare.py
# 4. Run baseline experiment
uv run train.py # → val_bpb: 1.234
# 5. Start AI Agent (Claude, Codex, OpenClaw, …)
# Prompt example: "Read program.md and propose an improvement to train.py."
# 6. Observe experiment log
tail -f results.tsv
# Example log line: 2 | 2026‑03‑07 10:05 | 1.220 | ✅ +1.1%Scenario 2: Automatic Bug Fixing
Goal : let the AI discover and fix bugs autonomously.
Project structure :
auto-bug-fixer/
├── program.md # Agent instructions
├── src/
│ └── app.py # Code with bugs
├── tests/
│ └── test_app.py # Failing tests
├── bug_report.md # Bug description
└── verify.py # Verification scriptSample program.md (rules and flow):
# AI Bug Fixer
## Your task
Analyse the failing test, fix the bug in <code>src/app.py</code>.
## Input
- Source code: <code>src/app.py</code>
- Failing test: <code>tests/test_app.py</code>
- Bug report: <code>bug_report.md</code>
## Output
- Fixed code
- Explanation of the fix
## Rules
1. Only modify <code>src/app.py</code>
2. All tests must pass after the change
3. No new bugs may be introduced
4. Keep code style consistent
## Flow
1. Read bug report
2. Run tests to confirm failure
3. Analyse code and locate issue
4. Propose a fix
5. Modify code
6. Run tests again
7. Submit the fix
## Success criteria
- All tests pass
- No regression bugs
- Code quality does not dropVerification script ( verify.py) runs the test suite and reports success or failure.
AI Writing Practical Applications
Scenario 1: Automatic Article Optimisation
Goal : let the AI improve article quality autonomously.
Project structure :
auto-article-optimizer/
├── program.md # Agent instructions
├── articles/
│ └── draft.md # Draft to be optimised
├── style_guide.md # Style guide
├── evaluate.py # Evaluation script
└── output/
└── optimized.md # Optimised articleKey parts of program.md (rules, current state, optimisation directions, flow, success criteria):
# AI Article Optimiser
## Your role
You are a professional editor improving article quality.
## Task
Optimise <code>articles/draft.md</code> to raise the overall quality score.
## Evaluation standards (weights)
- **Accuracy** (30%) – factual correctness
- **Readability** (25%) – fluency and logic
- **Engagement** (20%) – hook and conclusion
- **SEO** (15%) – keyword density, structure
- **Style** (10%) – adherence to style guide
## Rules
1. Keep core ideas unchanged
2. You may restructure paragraphs
3. You may add or remove examples and arguments
4. Must follow <code>style_guide.md</code>
5. Run evaluation after each change
## Current state
- Draft length: 2000 words
- Current score: 6.5/10
- Target score: 9.0/10
## Optimisation directions
1. Title optimisation – more compelling, include keywords
2. Opening improvement – capture reader in first 100 words
3. Structure re‑organisation – clearer logic
4. Add concrete examples
5. Strengthen conclusion – clear call‑to‑action
6. Language polishing – remove redundancy
## Flow
1. Read draft and style guide
2. Run evaluation to get baseline
3. Analyse weak points, devise a plan
4. Modify the article
5. Re‑run evaluation; if score improves, keep the change
6. Repeat steps 3‑5 until target is reached
## Submission format
```markdown
# Modification notes
1. Title: "..." → "..."
2. Opening: rewrote to add a story
3. Structure: moved section X before Y
4. Added 2 concrete case studies
5. Conclusion: added actionable summary
# Expected improvement
- Accuracy: 6.5 → 7.0
- Readability: 6.0 → 8.0
- Engagement: 6.5 → 8.5
- Overall: 6.5 → 7.8
```Evaluation script ( evaluate.py) computes the five sub‑metrics and aggregates them with the weights shown above, then prints a detailed breakdown.
Typical run logs show progressive improvements, e.g.:
# 1. Optimise title + opening, score: 7.2 ✅ keep
# 2. Re‑structure, score: 7.8 ✅ keep
# 3. Add examples, score: 8.3 ✅ keep
# 4. Polish language, score: 8.7 ✅ keep
# 5. Strengthen conclusion, score: 9.1 ✅ target reachedScenario 2: Automatic Content Generation
Goal : let the AI generate high‑quality articles from a topic list.
Project structure :
auto-content-generator/
├── program.md # Agent instructions
├── topics/
│ └── topic_list.md # List of topics
├── templates/
│ └── article.md # Article template
├── style_guide.md # Style guide
├── evaluate.py # Evaluation script
└── output/articles/ # Generated articlesKey points of program.md:
Generate an article for each topic. Quality standards: overall score ≥ 9.0/10, length 2000‑3000 words, originality > 90 %, factual accuracy = 100 %. Process: select a topic → research → generate using template → evaluate → if score < 9.0, optimise and re‑evaluate → save when the target is met.
Quick‑Start Guide
Step 1: Clone the original project
# Clone Karpathy's autoresearch
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
# List project files
ls -la
# You will see prepare.py, train.py, program.md, …Step 2: Install dependencies
# Install uv (Python package manager)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install project dependencies
uv sync
# Verify installation
uv run python --versionStep 3: Prepare data
# Download training data and train the tokenizer
uv run prepare.py
# Output: data preparation completeStep 4: Run a baseline experiment
# Manually run one training pass
uv run train.py
# Output example: val_bpb: 1.234Step 5: Start the AI Agent
# Use your favourite AI assistant (Claude, Codex, OpenClaw, …)
# Prompt example:
"Read program.md and start a new experiment. First read the current train.py, then propose an improvement."Step 6: Observe experiment logs
# Watch results in real time
tail -f results.tsv
# Format: experiment_id | timestamp | val_bpb | improvement?
# Example lines:
# 1 | 2026‑03‑07 10:00 | 1.234 | baseline
# 2 | 2026‑03‑07 10:05 | 1.220 | ✅ +1.1%
# 3 | 2026‑03‑07 10:10 | 1.235 | ❌ -0.1%Conclusion
Core Design Recap
Single‑file modification – controls risk and makes diffs easy.
Fixed time budget – ensures experiments are comparable and optimisable.
Single metric – provides a clear, architecture‑agnostic objective.
Self‑contained – keeps the system easy to understand, debug and extend.
Key Takeaways
✅ AI programming : automatic code optimisation and bug fixing.
✅ AI writing : automatic article optimisation and content generation.
✅ Rapid start : clone the repo, install dependencies, run the baseline, and launch an AI agent.
✅ Continuous iteration : multiple agents and human feedback drive ongoing improvement.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Frontend AI Walk
Looking for a one‑stop platform that deeply merges frontend development with AI? This community focuses on intelligent frontend tech, offering cutting‑edge insights, practical implementation experience, toolchain innovations, and rich content to help developers quickly break through in the AI‑driven frontend era.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
