How Claude Code AI Agents Generated 100 Research Papers in 10 Days
Within 228 hours, the Fully Automated Research System (FARS) built on Claude Code and other AI agents used 160 NVIDIA GPUs to produce 100 peer‑review‑level papers, achieving an average ICLR score of 5.05—higher than human submissions—while highlighting the expanding role, limits, and safety concerns of AI‑driven scientific automation.
1. A Jaw‑Dropping Number
On 13 Feb 2026 the Analemma team (led by Sun Tianxiang, a PhD from Fudan and core MOSS developer) launched the Fully Automated Research System (FARS) on a cluster of 160 NVIDIA GPUs. In 228 hours 28 minutes the system generated 244 research hypotheses and produced 100 complete papers , averaging 2 hours 17 minutes per paper . The total token consumption was 11.4 billion tokens at a cost of roughly US $104 k . Using Stanford’s Agentic Reviewer under ICLR standards, the papers received an average score of 5.05 , surpassing the human‑submitted average of 4.21 (the acceptance threshold is 5.39). The entire process was livestreamed with zero human intervention and the code is open‑source on GitLab.
Other notable milestones include Sakana AI’s AI‑Scientist‑v2 producing a fully AI‑written paper accepted at the ICLR 2025 workshop, and DeepMind’s Aletheia solving 63 of 700 Erdős‑style conjectures, four of which were previously unsolved.
2. The 2026 Landscape
After reviewing more than 20 recent reports, papers and news items, four dimensions were identified: end‑to‑end research systems, vertical‑domain breakthroughs, infrastructure, and evaluation benchmarks. The following concise overview captures the most representative systems:
Large‑scale autonomous research : FARS (Analemma) – 228 h → 100 papers, scores > ICLR average.
End‑to‑end autonomous research : Robin, AI‑Scientist‑v2, Aster, “大圣” – e.g., Robin (4‑person team + 3 agents) completed a 2.5‑month drug‑discovery pipeline.
Mathematics : Aletheia (DeepMind) – 95.1 % accuracy on IMO‑Proof Bench.
Biomedicine : Biomni (Stanford) – 800× acceleration of genome‑analysis pipelines.
Materials science : MARS (Chinese Academy of Sciences) – new material design in 3.5 h.
Drug discovery : Mozi, OrchestRA – multi‑agent pipelines with safety‑governed autonomy.
Research writing : OpenAI Prism (GPT‑5.2) – free LaTeX workbench.
Cross‑institution collaboration : Science Context Protocol (SCP) – 1 600+ interoperable tools.
The common thread is that AI is no longer a single‑point tool but a foundational infrastructure spanning the entire research workflow.
3. Five Key Tracks
3.1 End‑to‑End Autonomous Discovery
FARS employs a four‑module pipeline: Ideation → Planning → Experiment → Writing . Each module runs in parallel on a shared file system, allowing multiple projects to progress simultaneously. The system’s 160‑GPU cluster schedules both training and inference endpoints for the experiment agents.
During the live “FARS‑100” run, the system produced 244 hypotheses and 100 short papers covering AI safety, efficient inference, multimodal learning, reinforcement learning, etc. The papers were evaluated by Stanford’s Agentic Reviewer and scored 5.05 on average, close to the 5.39 acceptance benchmark. The system also reports negative results and can incorporate them into subsequent experiments within 3 days .
Design philosophy: each paper focuses on a single clear hypothesis, reporting both positive and negative outcomes – a “factory‑grade” research paradigm.
Comparative systems:
Aster (AI Labs, Feb 2026) – pursues extreme efficiency, achieving 1/190 of the compute of the human baseline on the ZAPBench task and winning the NanoGPT Speedrun.
InternAgent‑1.5 (Shanghai AI Lab, arXiv:2602.08990) – combines generation, verification, and evolution subsystems, emphasizing long‑term memory and cross‑disciplinary generality.
“大圣” – a China‑led effort (Shanghai AI Institute + Fudan) with 300+ skill packages and a “group memory” mechanism, already delivering a 20 M CNY value drug‑discovery project.
3.2 Vertical Domains
Aletheia (DeepMind) has pushed IMO‑Proof Bench accuracy from 65.7 % to 95.1 % and authored a paper on feature‑weight analysis, while also proposing an autonomy‑grading taxonomy for AI‑driven mathematics.
In experimental science, the MARS system (CAS Shenzhen Institute, led by Yu Xuefeng) orchestrates 19 LLM agents plus 16 domain tools into five functional groups, achieving a closed‑loop perovskite synthesis in only ten iterations. The work appears in Matter (Cell Press, DOI 10.1016/j.matt.2025.102577).
A Nature debate (Feb 2026) highlighted the tension: AI excels at well‑defined optimization tasks but struggles with ambiguous goals, non‑standard samples, or tasks requiring dexterous manipulation.
3.3 Drug Discovery
OrchestRA (arXiv:2512.21623) splits the pipeline into three specialist agents (biology, chemistry, pharmacology) that iteratively refine molecules based on a trillion‑scale knowledge graph.
Mozi (arXiv:2603.03655) adds a governance layer that forces a human‑in‑the‑loop checkpoint for high‑uncertainty decisions, a crucial safety measure for high‑risk pharmaceutical applications.
3.4 Infrastructure Layer
Scattered AI tools need a unifying “operating system”. The open‑source Science Context Protocol (SCP) extends Anthropic’s MCP with structured experiment metadata, a centralized hub, intelligent workflow orchestration, and standardized device drivers, now integrating >1 600 tools across biology, physics and chemistry.
On the personal‑tool side, OpenAI’s Prism (GPT‑5.2) offers a free AI‑native LaTeX workbench, while Claude Scholar bundles 40+ academic skills covering the full research lifecycle. Both lower the barrier from “code‑savvy technologist” to “any researcher who can describe a problem”.
3.5 Claude Code’s Role
Anthropic’s Claude ecosystem occupies a unique niche:
Base‑model selection: Stanford’s Biomni chose Claude after extensive benchmarking for superior scientific knowledge, programming ability and workflow integration, achieving an 800× speed‑up in wearable bio‑informatics analysis.
Claude Code 2.1 introduces Git Worktree parallel development, hot‑reloading of skills, and a hierarchical governance model that resolves multi‑user, multi‑task permission challenges.
Claude for Life Sciences attains a Protocol QA score of 0.83 (human baseline 0.79) and integrates 10× Genomics, Benchling, PubMed, etc., enabling end‑to‑end RNA‑seq QC to literature search within a single workflow.
Overall, Claude’s strategy is not merely “the strongest model” but a three‑pronged “model + toolchain + vertical knowledge” empowerment stack, differentiating it from OpenAI’s Prism (writing focus) and DeepMind’s Aletheia (mathematical discovery).
4. Cold‑Water Time: What AI Still Can’t Do
Meta Research’s AIRS‑Bench (arXiv:2602.06855) evaluated AI research agents on 20 tasks. AI outperformed humans on only 4 tasks (data‑processing, pattern‑matching) and lagged on the remaining 16, especially those requiring domain intuition, creative hypothesis generation or long‑term planning.
A Nature debate concluded that current lab‑automation systems excel at “optimisation with a clear scoring function” but fail when the goal is vague or requires dexterous manipulation.
Capability‑boundary table (summarised):
Literature search : fast but occasional citation errors – humans must judge academic value.
Data analysis : rapid and standardized – humans must interpret physical/biological meaning.
Experiment design : strong at parameter search – humans must devise novel paradigms.
Paper writing : coherent structure – humans must extract unique insights and avoid hallucinations.
Peer review : can check format and logic – humans must assess methodological soundness.
Bottom line: AI is an excellent executor but not yet a first‑principles thinker.
5. A Warning: Over‑Confidence of Wrong Answers
Large models can confidently produce incorrect results, a danger amplified in scientific contexts. Examples include:
Reporting a statistically significant policy effect on a two‑province panel with <1 % significance without questioning external validity.
Claiming a causal positive effect from merely correlated variables in the absence of instruments.
Converging a fluid‑dynamics loss to 1e‑5 while mistakenly fixing the inlet velocity to zero, rendering the solution physically meaningless.
These errors appear perfectly formatted and logically consistent, making them hard to spot without domain expertise. The Curie system (arXiv:2502.16069) proposes a “rigor module” that inserts verification steps inside and between agents, but human review after any critical conclusion remains indispensable.
6. Hands‑On: Karpathy’s Autoresearch – 100 Rounds of Self‑Improvement
On 7 Mar 2026 Andrej Karpathy (former Tesla AI director, OpenAI co‑founder) open‑sourced autoresearch . The core idea: researchers write a program.md describing the research agenda; an AI agent iteratively edits train.py, runs experiments, and decides whether to keep or revert changes.
Project layout (simplified): prepare.py – data download, BPE tokenizer, fixed evaluation function. train.py – ~630‑line GPT training loop (model, optimizer, flash‑attention, etc.). AI agents are the only component allowed to modify this file. program.md – human‑written protocol (research steps, when to keep or discard, and a “NEVER STOP” rule).
Quick‑start (requires a single NVIDIA GPU, Python 3.10+, and the uv package manager):
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Install dependencies
uv sync
# Prepare data (≈2 min)
uv run prepare.py
# Run a baseline training (≈5 min)
uv run train.pyThe baseline prints metrics such as:
val_bpb: 0.997900
training_seconds: 300.1
peak_vram_mb: 45060.2
mfu_percent: 39.80
num_params_M: 50.3 val_bpb(validation bits‑per‑byte) is the sole objective – lower is better. All subsequent experiments run for a fixed 5‑minute window to ensure comparability.
Autonomous loop (pseudo‑code):
LOOP FOREVER:
1. Propose an improvement (e.g., wider layers, new activation, learning‑rate tweak)
2. Edit train.py
3. git commit
4. uv run train.py > run.log # 5‑min training
5. Read val_bpb
6. If val_bpb lower → KEEP and advance branch
7. Else → git reset (DISCARD)
8. Log result to results.tsv
9. RepeatThe program.md enforces two principles:
NEVER STOP – the agent runs indefinitely until manually halted.
Simplicity rule – keep only clean, low‑complexity improvements (e.g., a 0.001 val_bpb gain from deleting code is acceptable, but the same gain from adding 20 lines of hacky code is not).
An accompanying analysis.ipynb loads results.tsv, computes keep‑rate, plots the running best val_bpb curve, and extracts the top‑impact code changes.
Verification checklist after a run includes confirming a baseline row, monotonic decrease of kept val_bpb, proper error messages for crashes, reasonable VRAM usage, and reproducibility of the final train.py.
7. Final Thoughts: Scientists as Designers
By March 2026 the frontier has shifted dramatically:
Analemma’s FARS produced 100 papers in 228 h with scores above human averages.
FutureHouse’s Robin delivered a drug‑discovery pipeline that entered peer review.
DeepMind’s Aletheia solved decades‑old mathematical conjectures.
Karpathy’s autoresearch lets anyone with a GPU experience autonomous research.
Claude Code and related tools are no longer just code‑generation assistants; they act as the glue that stitches disparate research stages into a self‑operating pipeline.
The core competitive edge for scientists is moving from “what can be done” to “what should be done”. With AI handling literature mining, data cleaning, code debugging and figure generation, researchers can devote mental bandwidth to formulating original questions, designing unconventional experiments, and providing deep interpretive insights.
Future labs will see scientists as “chief designers” who define the blueprint, aesthetics and safety of the research building, while AI agents become the skilled construction crew.
AI Agent Research Hub
Sharing AI, intelligent agents, and cutting-edge scientific computing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
