Stanford CRFM: AI-generated CUDA kernels match or beat PyTorch baselines in early KernelBench results

Stanford University’s Center for Research on Foundation Models (CRFM) has published early results showing that AI-generated CUDA kernels — produced without using libraries such as CUTLASS or Triton — performed close to or, in some cases, exceeded expert-optimized production kernels shipped in PyTorch. The work was posted earlier than the team intended: a post on the CRFM site explains that the results were good enough to share before the full research effort was complete.

All results in the post are benchmarked on an Nvidia L40S GPU, and performance is defined as reference time divided by generated kernel time, expressed as a percentage. A value above 100% means the generated kernel is faster than the PyTorch reference.

What the team was trying to do

The stated goal was to generate synthetic data for training better kernel generation models. The team used the KernelBench benchmark setup — released in December 2024 — as their task framing: given PyTorch code, an LLM writes custom CUDA kernels to replace the operators, aiming for a speedup. Consistent with the original KernelBench design, reference code is in default FP32, and solutions using lower precision are valid given a tolerance threshold of 1e-02. Each problem specifies fixed tensor sizes, so the benchmark tests for speed at that specific size rather than a generally applicable kernel.

Correctness is verified by checking numerical equality between the generated code and the torch reference across many random inputs.

Limitations of the standard approach

The most common test-time scaling strategy for kernel optimization, the post notes, is sequential revision: a multi-turn loop where a model edits a kernel, checks correctness and performance, then tries again. The team identifies the main limitation as a lack of optimization idea diversity. Sequential loops, according to the post, often fall into local minima, revisiting the same classes of transformations and making inefficient use of test-time compute.

The team’s changes

The team introduced two changes to address this, which they describe as turning the loop from “chat with a compiler” into structured exploratory search guided by explicit optimization hypotheses and aggressively parallel evaluation. The post does not enumerate the two changes verbatim in the excerpt available.

The optimization trajectory for Conv2D

The post provides a detailed round-by-round trajectory for Conv2D, where the PyTorch reference runs in 1.41 ms. The trajectory illustrates how optimization depth varied across rounds:

Round 0: 7.02 ms (20.1% of reference) — basic CUDA kernel
Round 1: 7.54 ms (18.8%) — load invariant tensors with __ldg
Round 2: 3.46 ms (41.0%) — convert convolution to FP16 Tensor-Core GEMM
Round 3: 3.67 ms (38.7%) — double-buffer cp.async pipeline overlapping global-memory loads with Tensor-Core compute
Round 4: 3.46 ms (41.0%) — implicit matmul using a previously generated GEMM kernel (seeded manually)
Round 5: 1.91 ms (74.9%) — precompute and reuse k_idx-decomposed kernel/input indices in shared memory
Round 6: 1.37 ms (103.6%) — cache N-dimension GEMM indices in shared memory
Round 7: 1.38 ms (102.9%) — per-warp shared memory buffers to eliminate warp serialization
Round 8: 1.37 ms (103.6%) — cache base input coordinates in shared memory
Round 9: 1.36 ms (105.1%) — software-pipeline B-fragment loading to overlap next tile’s reads with current WMMA computations
Round 10: 1.07 ms (133.6%) — reuse N-dimension GEMM decomposition for output address calculation, removing division and modulo operations
Round 11: 1.21 ms (117.4%) — remove hi/lo decomposition in half WMMA operations
Round 12: 1.01 ms (141.2%) — overlap K-loop global memory loads with MMA computation using double buffering
Round 13: 0.795 ms (179.9%) — vectorized shared memory writes using half2

The final Conv2D kernel, achieved in round 13, ran at 179.9% of the PyTorch reference — faster than the baseline. The post comments that the final code uses “advanced CUDA techniques that we find challenging to write ourselves.”

Models and scale

The team ran 10 problems from KernelBench level 1 using OpenAI o3 and Gemini 2.5 Pro, for 5 rounds per problem. The post reports that most of the best-performing kernels emerged in later rounds, with the majority first found in round 4 or 5.

The team also observed that high-performing kernels clustered into recurring optimization categories, which the post notes aligns with their experience writing kernels by hand.

What it means, according to the authors

The post draws a comparison to other recent work, including AlphaEvolve and Gemini 2.5 Pro Deep Think, suggesting that “clever search and branching strategies can unlock scientific innovation and tackle complex problems” without necessarily requiring additional model training. At the same time, the team notes that the approach also produces synthetic training data that could improve future models — describing it as “both a powerful test-time scaling method and a step toward smarter, more data-efficient model development.”

The post stops short of claiming the method generalizes beyond the specific benchmark sizes tested and explicitly notes that the results were shared earlier than planned because of how promising they appeared. Additional kernels are available in a linked GitHub repository.