How an AI Agent Outperformed NVIDIA Engineers in 7‑Day GPU Kernel Optimization
This article analyzes the AVO system, an autonomous AI agent that replaces traditional evolutionary search pipelines to iteratively improve CUDA attention kernels on NVIDIA's Blackwell B200 GPU, achieving up to 10.5% higher throughput than hand‑tuned implementations after a week of nonstop optimization.
Introduction
Recent work shows that an AI agent can continuously run evolutionary search for days, automatically generating and refining CUDA kernels for attention on NVIDIA's Blackwell B200 GPU. The resulting kernels reach up to 1668 TFLOPS, which is 3.5% faster than cuDNN 9.19.1 and 10.5% faster than FlashAttention‑4.
Limitations of Traditional Evolutionary Search
Classic evolutionary search maintains a population of candidate solutions and applies hand‑crafted mutation operators (e.g., random parameter changes, segment swaps). These heuristics are insufficient for the highly complex, hardware‑aware task of GPU kernel tuning. Systems such as FunSearch and AlphaEvolve replace the mutation operator with a large language model (LLM), but the LLM remains confined to a fixed pipeline that only generates candidates while the surrounding framework controls selection, evaluation, and population management. This restriction limits performance on engineering problems that require deep hardware reasoning.
AVO: Redefining the Mutation Operator
AVO removes the fixed pipeline and lets an AI agent act as the mutation operator itself. The agent receives the entire population, a knowledge base, and a scoring function, and decides autonomously what data to query, which code to modify, when to test, and when to restart the search.
Each candidate is a CUDA kernel (including inline PTX). The scoring function evaluates two dimensions:
Numerical correctness – the output is compared against a reference implementation.
Throughput – measured in TFLOPS on the target hardware.
The knowledge base contains the CUDA programming guide, PTX ISA documentation, Blackwell architecture specifications, and the source code of FlashAttention‑4. The deployed agent is NVIDIA’s internal general‑purpose programming agent, equipped with standard software‑engineering tools (code editing, shell command execution, filesystem navigation, document retrieval). By attaching the knowledge base and scoring function, the agent becomes a self‑contained kernel‑optimization expert.
Autonomous Optimization Loop
The agent iteratively performs an edit‑evaluate‑diagnose cycle. It examines profiling data from previous candidates, consults hardware documentation, generates code modifications, and runs the scoring function. If a candidate fails correctness or does not improve the best score, the agent diagnoses the issue, revises the code, and repeats until a satisfactory improvement is found. A self‑supervision mechanism monitors the overall evolution trajectory and redirects the search when stagnation or unproductive cycles are detected.
Experimental Setup
Experiments were conducted on an NVIDIA Blackwell B200 GPU using CUDA 13.1 and PyTorch 2.10.0. Baselines included cuDNN 9.19.1 and FlashAttention‑4. Benchmarks measured forward‑prefill throughput for attention heads of dimension 128, BF16 precision, and sequence lengths of 4096, 8192, 16384, and 32768. Each test used 16 heads and evaluated both causal (masked) and non‑causal modes.
Performance Results
Causal multi‑head attention (MHA) : AVO outperformed cuDNN by 0.4%–3.5% and FlashAttention‑4 by 5.0%–10.5% across all sequence lengths.
Non‑causal MHA : Gains over cuDNN ranged from 1.8% to 2.4% on longer sequences; short sequences showed parity within measurement noise.
Grouped‑query attention (GQA) migration : The agent adapted the optimized kernels to GQA in ~30 minutes without human intervention. In causal GQA, AVO achieved up to 7.0% improvement over cuDNN and 9.3% over FlashAttention‑4; in non‑causal GQA, improvements were 6.0% and 4.5%, respectively.
Evolution Trajectory and Key Optimizations
During the seven‑day run, AVO produced 40 confirmed kernel versions and explored >500 candidate directions. Early versions (v1‑v20) captured large‑scale gains; later versions (v21‑v40) refined performance through fine‑grained scheduling and resource allocation.
Three representative optimizations were identified:
Branch‑less accumulator scaling – the single largest gain.
Correction/MMA pipeline overlap.
Register rebalancing across thread‑block groups.
Each required coordinated reasoning across multiple hardware subsystems (synchronization, memory ordering, pipeline scheduling, register allocation), demonstrating expert‑level micro‑architectural insight.
Generalization and Outlook
The AVO methodology treats the AI agent as a mutable mutation operator, a paradigm that can be extended beyond attention kernels to other performance‑critical software, diverse hardware platforms, and domains that benefit from prolonged autonomous exploration.
Reference: https://arxiv.org/pdf/2603.24517v1
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
