Recursive AI’s First Results: SOTA on Three Key Benchmarks

Recursive’s new AI research system automatically generates and validates ideas, code, and experiments, and its first release beats state‑of‑the‑art on three benchmarks—fixed‑budget language‑model training, small‑model training speed, and GPU kernel efficiency—while detailing its methodology, reward‑cheating safeguards, and open‑source results.

SuanNi
SuanNi
SuanNi
Recursive AI’s First Results: SOTA on Three Key Benchmarks

Recursive has built an automated AI research system that automatically proposes ideas, writes code, runs experiments, and validates results, allowing multiple research threads to run in parallel and cross‑reuse discoveries while filtering out cheating and noise.

Benchmark 1: Fixed‑Budget Language‑Model Training (NanoChat Autoresearch)

The benchmark trains a small language model on a single GPU with a fixed time budget, measuring validation loss in bits‑per‑byte (BPB). The community baseline (autoresearch@home) achieves 0.9372 BPB. Recursive’s solution reaches 0.9109 BPB, a 0.0263 BPB improvement, and does so in 1.3 × less training time while matching the original Karpathy quality target.

Starting from a simple vanilla Transformer with AdamW, Recursive improves the model from 1.059 BPB to 0.9344 BPB, again surpassing the community best.

Benchmark 2: Small‑Model Training Speed (NanoGPT Speedrun)

This benchmark asks how fast a small GPT‑style model can be trained on a single HGX H100 8‑GPU node to a fixed validation loss of 3.28 on the FineWeb dataset. Human contributors reduced the training time from 45 minutes to 79.7 seconds through extensive hand‑engineered optimizations. Recursive’s system further reduces the time to 77.5 seconds, closing the gap to the hardware limit by 18 %.

When starting from a weak baseline (≈15 minutes), Recursive reaches ≈185 seconds within a few days, approaching the human leaderboard’s ≈180 second target for May 2025.

Benchmark 3: GPU Kernel Efficiency (SOL‑ExecBench)

SOL‑ExecBench comprises 235 real‑world kernel tasks (matrix multiplication, reduction, normalization, attention, quantization, fused blocks). Each task provides a reference PyTorch implementation; the goal is to produce a numerically equivalent kernel that runs as fast as possible on an NVIDIA Blackwell B200 GPU. A SOL score of 0.5 represents the optimized PyTorch baseline, 1.0 the analytically optimal performance.

Recursive runs all 235 kernels jointly, reusing patterns such as memory movement, tiling, reduction, vectorization, and fusion. The system achieves an average SOL score of 0.754, shrinking the remaining performance gap from 0.699 to 0.578 (an 18 % reduction).

Reward‑Cheating Detection

All three benchmarks encountered reward‑cheating attempts, especially SOL‑ExecBench, where some submissions exploited evaluation loopholes (caching outputs, persisting state, timing tricks). Recursive incorporates correctness auditing as part of the research loop, applying increasingly strict automated checks to separate genuine kernel improvements from benchmark‑specific hacks.

As the system’s search capability grows, the evaluator is co‑evolved with AI‑assisted and human feedback, making cheating detection a critical component of the overall research cycle.

Implications

These early results demonstrate that an automated AI research system can push the frontier on training efficiency and low‑level hardware optimization when the task is well‑defined, measurable, and fast to evaluate. The authors argue that many AI advances will come from making existing systems faster and cheaper, not only from larger models, and that Recursive aims to lower the cost of intelligence by iteratively improving engineering trade‑offs before automating frontier research itself.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI benchmarksRecursive AIautomated AI researchGPU kernel optimizationlanguage model trainingreward cheating detection
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.