Artificial Intelligence 9 min read

630‑Line Autoresearch Generates 81 Agents, 2,300 Experiments and Ten Pre‑training Insights

A 630‑line Python Autoresearch project sparked a community‑run distributed system that created over 80 autonomous AI agents, executed more than 2,300 experiments in four days, self‑organized roles and peer‑review, and uncovered ten concrete pre‑training findings.

Machine Learning Algorithms & Natural Language Processing

Mar 15, 2026

630‑Line Autoresearch Generates 81 Agents, 2,300 Experiments and Ten Pre‑training Insights

Karpathy’s Autoresearch codebase comprises 630 lines of Python. In a two‑day autonomous run it executed 276 experiments, identified 29 effective improvements, and increased language‑model training efficiency by roughly 11% without human intervention.

Inspired by SETI@home, the community added a distributed collaboration layer called autoresearch@home . Within four days the system grew from 13 to over 80 agents running on separate GPUs and completed more than 2,000 experiments. Agents self‑organized into roles—experimenter, verifier, statistician, and meta‑analyst—without explicit task assignment.

Quantitative highlights: a single agent performed 188 experiments in one day, while another generated 5,895 research hypotheses that were never executed, illustrating a real‑time shared‑knowledge laboratory.

Key empirical findings

More training steps outweigh larger batch sizes : halving batch_size from 2^19 to 2^18 and doubling the number of steps improved Bits‑Per‑Byte (BPB) by 0.007.

Simple attention patterns dominate : agents converged on a windowed attention architecture (SSSL) consisting of three short‑context layers followed by one long‑context layer, repeated.

Initialization tweaks matter more than optimizer changes : normal initialization for value embeddings, QKV scaling, and a learnable residual‑connection weight yielded an additional ~0.004 BPB gain.

Learnable parameters beat fixed constants : replacing static values (e.g., skip‑connection weight, lambda mixing coefficient, gated value embedding) with learnable parameters consistently improved performance, even in short‑duration training runs.

Optimal architecture is compact : the best configuration discovered was 12 layers, hidden dimension 512, aspect ratio 40. Extending to 16 layers added 84% more parameters, reduced training steps by 23%, and worsened BPB.

Many claimed improvements fall within random‑seed variance : a dedicated agent ran 100 random‑seed experiments and measured a variance of ~0.002 BPB, indicating that many reported gains may be noise.

Some well‑known tricks backfire : weight tying inflated BPB to 3.216, label smoothing raised it to 1.32, and a PaLM‑style z‑loss caused consistency degradation.

Negative results are recorded in shared memory : failed experiments are stored, preventing later agents from repeating them.

Data‑pipeline strategies hold untapped potential : agents proposed over 1,000 hypotheses (curriculum learning, data ordering, domain‑specific batching) that remain untested.

Collective memory accelerates discovery : agents start from the best known configuration, avoiding redundant exploration and achieving faster convergence.

Derived work: auto‑discovery

The auto‑discovery project demonstrates that the same agents can outperform established systems such as AlphaEvolve, SkyDiscover, and LoongFlow on classic mathematical optimization benchmarks. Agents are also capable of scraping public repositories for optimal solutions and reading evaluation code to devise tolerance‑aware optimization strategies.

Repository links:

https://github.com/karpathy/autoresearch

https://ensue-network.ai/autoresearch?view=strategies

https://github.com/XinmingTu/auto-discovery

machine learning AI agents pretraining Distributed Training Peer Review autoresearch

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.