Artificial Intelligence 10 min read

How CUDA Agent Lets Anyone Write High‑Performance CUDA Kernels, Challenging Nvidia’s AI Moat

CUDA Agent, a large‑scale reinforcement‑learning system from ByteDance and Tsinghua, can automatically generate and optimize CUDA kernels that outperform torch.compile by up to 2× on simple kernels and achieve around 40% higher speed than proprietary models on the hardest benchmarks, while detailing its data‑synthesis pipeline, training workflow, and current limitations.

Machine Learning Algorithms & Natural Language Processing

Mar 3, 2026

How CUDA Agent Lets Anyone Write High‑Performance CUDA Kernels, Challenging Nvidia’s AI Moat

Recent research from ByteDance Seed and Tsinghua AI introduced CUDA Agent, an AI system that generates fast, optimized CUDA kernels rather than merely correct ones. The model is trained to maximize actual GPU speed using real performance data as rewards, shifting focus from compilation success to hardware‑level efficiency.

Performance comparisons show that on simple and medium kernels CUDA Agent runs up to 2× faster than torch.compile, and on complex kernels it is about 92% faster. Even against strong proprietary models such as Claude Opus 4.5 and Gemini 3 Pro, it achieves roughly a 40% speed advantage.

CUDA Agent’s architecture consists of three core components: a scalable data‑synthesis pipeline, an enhanced CUDA development environment with skill‑based verification and performance analysis, and a reinforcement‑learning algorithm that supports long‑context training.

Data Synthesis

The team builds training tasks through a three‑stage pipeline: seed problem crawling, compositional synthesis using LLMs, and result‑based filtering.

Seed operators are extracted from torch and transformers. Each operator is represented as a Python class with __init__ and forward methods.

During compositional synthesis, up to five torch operators are sampled and combined into fused tasks.

The filtering stage retains tasks that run correctly in both eager and compile modes and removes those with randomness or constant outputs.

To avoid “cheating,” tasks that produce identical outputs across different inputs are discarded.

Workload control limits eager‑mode runtime to 1 ms–100 ms and excludes samples too similar to KernelBench.

The final dataset, CUDA‑Agent‑Ops‑6K, contains 6,000 high‑quality training samples designed for scalable reinforcement learning.

Agent Environment

The agent follows a ReAct‑style workflow, integrating code tools and a CUDA Skill specification (SKILL.md). It supports an iterative encode‑compile‑debug loop with performance‑analyzer feedback.

Standard workflow: profile native PyTorch implementation, write CUDA kernel and binding code, compile in a GPU sandbox, and iteratively optimize.

Goal: pass correctness checks and achieve at least 5% speedup over torch.compile.

Reward design uses milestone‑based discrete rewards based on correctness and performance gains.

Cheat‑prevention measures include protected verification scripts, disallowed rollback calls, multiple input checks, synchronized warm‑up before profiling, and disabling network retrieval.

Training Process

Training uses a multi‑stage design to stabilize the long‑sequence RL task of CUDA code generation.

Single‑round PPO warm‑up improves basic CUDA generation capability.

Actor initialization employs forward‑trajectory rejection‑fine‑tuning (RFT) to filter out inefficient loops and invalid tool calls.

Critic pre‑training on a value function ensures reliable advantage estimates early in training.

The system remains stable with contexts up to 128 k tokens, up to 150 training rounds, and up to 200 evaluation rounds, enabling continuous reward growth.

Core Experimental Results

On the KernelBench benchmark, CUDA Agent achieves a 96.8% acceleration rate over compile and a geometric mean speedup of 2.11× across all levels. Specifically, on Level‑3 (hardest) tasks it reaches a 90% acceleration rate, about 40 percentage points higher than the strongest proprietary baseline, and on Level‑2 it attains 100% acceleration with a 2.80× geometric mean.

The study notes two main limitations: it does not compare against more complex compiler frameworks such as TVM, and the training pipeline requires substantial GPU resources and process‑level isolation, leading to high computational and engineering costs.

Overall, CUDA Agent demonstrates that large language models can acquire “hardware intuition” through reinforcement learning driven by real performance feedback, suggesting a path toward fully automated, highly optimized compute infrastructure.

CUDA GPU Optimization torch.compile KernelBench

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.