Artificial Intelligence 17 min read

AutoResearch: 630‑Line AI Agent That Self‑Evolves in 72 Hours and Earns 12.7k Stars

AutoResearch is a 630‑line Python project that lets an AI agent autonomously run machine‑learning experiments on a single GPU using a fixed five‑minute budget, a single val_bpb metric, automatic code edits, and git‑based decisions, showcasing a minimal yet complete training framework with the novel MuonAdamW optimizer.

Shuge Unlimited

Mar 10, 2026

AutoResearch: 630‑Line AI Agent That Self‑Evolves in 72 Hours and Earns 12.7k Stars

Overview

AutoResearch is a 630‑line Python project that enables an LLM‑driven AI agent to conduct autonomous machine‑learning research on a single GPU. Each experiment runs for a fixed five‑minute training budget, is evaluated with the validation bits‑per‑byte metric ( val_bpb), and the agent decides whether to keep or revert the code change.

Design Principles

Fixed‑time budget

All experiments use TIME_BUDGET = 300 seconds, providing:

Comparability – identical compute budget for every run.

Predictability – about 12 rounds per hour (≈100 rounds in an eight‑hour day).

Fairness – avoids the illusion that longer training automatically yields better results.

Single‑file modification scope

The agent may only edit train.py. prepare.py and the evaluation harness remain read‑only, keeping git diff output clear and limiting the search space.

Minimal yet complete training stack

Despite its small size, AutoResearch implements a full LLM training pipeline:

Decoder‑only transformer architecture.

RMSNorm, RoPE, Grouped Query Attention (GQA).

Novel components: ResFormer Value Residual, Sliding Window Attention, ReLU² activation, Softcap Logits.

Optimizer: MuonAdamW with custom learning‑rate groups and orthogonalization.

MuonAdamW optimizer

Two parameter groups are defined:

embedding_lr = 0.6   # token embeddings
unembedding_lr = 0.004   # LM head
scalar_lr = 0.5   # scalar parameters
adam_betas = (0.8, 0.95)

matrix_lr = 0.04   # 2‑D matrix parameters (Q, K, V, O, MLP weights)
momentum = 0.95   # Nesterov momentum
ns_steps = 5   # Newton‑Schulz orthogonalization iterations
beta2 = 0.95   # variance‑reduction beta2
weight_decay = 0.2   # cautious decay

The optimizer combines:

Nesterov momentum for accelerated convergence.

"Polar Express" orthogonalization – a fast Newton‑Schulz iteration that approximates matrix orthogonalization.

NorMuon variance‑reduction, a second‑moment estimator similar to Adam but more stable for matrix parameters.

Evaluation Metric val_bpb

Bits‑per‑byte is computed as: BPB = total_nats / (log(2) * total_bytes) It is independent of vocabulary size and more stable than perplexity. Special tokens with zero byte length are excluded from the calculation.

Autonomous Experiment Loop

1. AI reads program.md and current code
2. Modifies train.py (typically 1–2 edits)
3. git commit
4. Runs a 5‑minute training
5. Evaluates val_bpb
6. Appends result to results.tsv
7. Decision:
   - if val_bpb improves → keep commit
   - else → git reset
8. Return to step 1 for the next round

Pseudocode for the decision:

if val_bpb_improved:
    keep_commit()   # retain change
else:
    git_reset()    # discard change

Simplicity criteria are applied: a 0.001 val_bpb improvement with 20 lines of hacky code is rejected, whereas the same improvement with fewer lines is accepted. This prevents the agent from accumulating overly complex, low‑impact changes.

Practical Setup

Typical commands (requires uv and a GPU with ≥24 GB VRAM):

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone repository
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch

# Install dependencies
uv sync

# Prepare data and tokenizer (≈2 min)
uv run prepare.py

# Run a single training round (≈5 min)
uv run train.py

Hardware: minimum RTX 3090 (24 GB VRAM); recommended H100 (80 GB VRAM). If memory is insufficient, reduce DEPTH, DEVICE_BATCH_SIZE, or MAX_SEQ_LEN in train.py and prepare.py.

Program.md authoring tips

Good example (clear direction, constraints, simplicity):

Focus on reducing training loss by exploring:
1. Different attention mechanisms (Linear, Flash)
2. Alternative activation functions (SwiGLU, GeGLU)
3. Novel normalization techniques

Constraints:
- Do NOT change random seed
- Do NOT increase model size beyond 100 M parameters
- Simplicity is preferred: a 0.001 improvement with 20 lines is not worth it

Bad example (vague):

Try to improve the model.

Comparison with Existing Tools

Compared with hyper‑parameter optimizers (Optuna, Ray Tune), AutoML platforms (AutoGluon, H2O) and generic agent frameworks (LangChain, AutoGPT):

Search space : AutoResearch modifies code (architecture, optimizer, training loop) whereas traditional tools search predefined parameter grids.

Search strategy : LLM‑guided intelligent search vs Bayesian or grid search.

Human effort : Single instruction in program.md vs extensive configuration of search space and objectives.

Innovation capability : Enables architectural changes; traditional tools are limited to parameter tuning.

Result interpretability : Every change is tracked in git, providing full auditability; many AutoML solutions act as black boxes.

Resources

GitHub repository: https://github.com/karpathy/autoresearch

Karpathy tweet: https://x.com/karpathy/status/2029701092347630069

Hacker News discussion: https://news.ycombinator.com/item?id=47291123

Dataset (climbmix‑400b‑shuffle): https://huggingface.co/datasets/karpathy/climbmix-400b-shuffle

AutoResearch vs traditional hyper‑parameter tuning

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Agent LLM research Machine Learning Automation Self‑evolving AutoResearch MuonAdamW val_bpb

Written by

Shuge Unlimited

Formerly "Ops with Skill", now officially upgraded. Fully dedicated to AI, we share both the why (fundamental insights) and the how (practical implementation). From technical operations to breakthrough thinking, we help you understand AI's transformation and master the core abilities needed to shape the future. ShugeX: boundless exploration, skillful execution.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.