Recursive AI Takes Its First Step: Automated Research System Sets New SOTA Benchmarks

Recursive Superintelligence unveiled an open‑source system that automates the AI research loop, achieving state‑of‑the‑art results on three distinct benchmarks—NanoChat autoresearch, NanoGPT speedrun, and SOL‑ExecBench—while illustrating the practical progress toward recursive self‑improvement warned about by Anthropic.

Machine Heart
Machine Heart
Machine Heart
Recursive AI Takes Its First Step: Automated Research System Sets New SOTA Benchmarks

Anthropic recently published “When AI Builds Itself,” revealing that over 80% of its codebase was generated by Claude and that Claude can accelerate training code by about 52×, prompting a call for industry‑wide safeguards against uncontrolled recursive self‑improvement.

In response, the newly founded Recursive Superintelligence, co‑founded by Tian Yuandong, released its first public technical artifact called First Steps Toward Automated AI Research . The system is designed to close the traditional human‑centric AI research loop—idea, code, experiment, analysis—by automatically proposing experiments, generating code, executing them, learning from results, and deciding subsequent searches. It supports parallel research tracks, cross‑task reuse, and embeds reward‑hacking detection to prevent shortcutting metrics.

Benchmark 1: NanoChat Autoresearch (Fixed‑Budget Small Model Training)

Following Andrej Karpathy’s autoresearch challenge, the task fixes a five‑minute GPU budget and seeks the lowest validation bits‑per‑byte (BPB). Recursive’s system started from the same code as the community baseline and reduced BPB from the best 0.9372 to 0.9109, a 0.0263 improvement equivalent to requiring 1.3× less training time for the same quality.

The gain stems from a richer short‑context memory: a hash‑based value path that simultaneously stores bigram and trigram information with learnable gating, differing across Transformer layers to lower collision probability. This variant resembles DeepSeek Engram but is a novel deployment for the fixed‑budget scenario.

Benchmark 2: NanoGPT Speedrun (Training‑Time Minimum)

The community’s NanoGPT speedrun aims to reach a validation loss of 3.28 on eight H100 GPUs as quickly as possible. After years of community optimization, the record stood at 79.7 seconds. Recursive’s system further cut the time to 77.5 seconds, shaving 2.2 seconds and matching the latest human‑driven improvements.

Key techniques include:

FP8 attention computation : extending FP8 precision from the model head to the full attention matrix, doubling Tensor Core throughput while keeping BF16 for back‑propagation stability.

Annealed optimizer noise : injecting zero‑mean Gaussian noise into the NorMuon optimizer with a linear decay schedule, encouraging broader exploration before convergence.

Fused MLP kernel : a custom Triton kernel that stores only squared ReLU activations forward and recomputes the unsquared values during back‑propagation, eliminating one full‑tensor memory round‑trip.

Benchmark 3: SOL‑ExecBench (GPU Kernel Optimization)

SOL‑ExecBench evaluates 235 GPU kernels across matrix multiplication, reduction, normalization, attention, quantization, and fused blocks. Scores are normalized to a SOL metric where 0.5 matches a PyTorch reference and 1.0 approaches hardware limits. The previous best public score was 0.699.

Recursive’s system ran all 235 kernels, reusing discovered optimizations such as memory‑movement strategies and tiling patterns, achieving a SOL score of 0.754—an 18% reduction in the gap to the theoretical hardware ceiling.

The authors note that kernel engineering is a highly specialized skill, and their system generated these ideas without deep kernel expertise, highlighting the power of AI‑driven discovery.

Context and Outlook

Recursive Superintelligence, founded in late 2025, raised $650 million at a $4.65 billion valuation, backed by GV, Greycroft, Nvidia Ventures, and AMD Ventures. Its mission aligns with other recent efforts—Yann LeCun’s AMI Labs and David Silver’s Ineffable Intelligence—to let AI systems autonomously generate knowledge and reduce human involvement in research.

While the current system excels in well‑defined, fast‑feedback scenarios, the authors acknowledge that extending autonomous research to open‑ended scientific problems remains a significant challenge, particularly in preventing reward‑hacking at scale.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

BenchmarkingAI automationAnthropicrecursive self-improvementautomated AI researchGPU kernel optimization
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.