METR analysis of NanoGPT speedrun finds AI agents made four contributions as humans drove 31x speedup over two years

METR, the AI evaluation research organization, has published an analysis of the NanoGPT speedrun leaderboard as evidence for understanding how AI agents contribute to AI research and development tasks. The analysis classifies each of the 77 records submitted between May 2024 and March 2026 by optimization depth and idea provenance, and examines where AI agents have begun to appear in the record history.

The NanoGPT speedrun

The NanoGPT speedrun is a public challenge with a simple goal: train a language model to a target validation loss on the FineWeb dataset using 8×H100 GPUs as fast as possible. It is a small-scale version of LLM pretraining, with a public record history showing each contributor’s code changes, descriptions, and cited sources.

The post focuses on the small track, which targets a validation loss of 3.28 starting from GPT-2-small at 124M parameters. From May 2024 to March 2026, 36 contributors submitted 77 records, cutting training time from 45 minutes to 1.43 minutes — a 31x speedup. As of April 2026, four records in the official history are attributed to AI agents rather than human contributors.

Contributions from the speedrun have reached frontier-scale models. The post notes that the Muon optimizer, introduced in record 3, has since been used in GLM-4.5 and Kimi K2.

Classification methodology

With the help of Claude Code, the author reviewed each merged pull request and classified contributions along two dimensions.

Optimization depth — how non-obvious the idea was — ran from Shallow (tuning hyperparameters, upgrading a library) through Moderate (non-trivial adaptation requiring domain expertise) and Deep (novel idea or non-obvious cross-domain application) to Breakthrough (new research contribution adopted widely). The Muon optimizer was classified as Breakthrough. The Bigram Hash Embedding technique, which hashes token pairs into per-layer residual additions, was classified as Deep.

Provenance — where the idea originated — was categorized as Invented (originated in the speedrun), Adapted (built on existing work with significant modification), or Imported (directly applying a technique from a paper or library). The Muon optimizer was Invented; applying Flash Attention 3 was Imported; applying the U-net skip-connection pattern to transformers was Adapted.

The full classification dataset is available at metr.org.

Key findings

The post identifies several patterns in the classification data.

Shallow contributions carried significant weight. Shallow and moderate contributions together drove roughly 21x of the 31x total speedup. The post cites record 12, which migrated to the newly released FlexAttention PyTorch API, as an example: the library itself is sophisticated, but the contribution was essentially migrating to it, and it achieved a 30% speedup. Other examples of shallow contributions include upgrading PyTorch to version 2.5.0 (record 7) and lowering logit softcap from 30 to 15 (record 18).

Deep contributions appeared throughout, not only early. The post notes that deep and breakthrough ideas were not front-loaded in the timeline; they continued to appear as the speedrun matured. Record 3 (Muon optimizer) is a breakthrough, but the excerpt notes that deep ideas — such as Paired Head Attention at record 58 — appeared substantially later.

Limitations the author flags

The post lists four challenges in interpreting the speedrun as evidence for AI R&D capability.

First, contamination: ideas relevant to the task may already be in models’ training data, either from public codebases or from published ML literature, making it difficult to determine how much the agent independently contributed versus retrieved.

Second, survivorship bias: only successful ideas appear in the record. Failed attempts by both humans and agents are not visible in the public history.

Third, scale-dependence: training GPT-2-small on 8×H100s is not frontier pretraining. Training at 124M parameters may differ substantially from what matters at 100B+ parameters with different architectures, such as mixture-of-experts.

Fourth, composability: pretraining is one component of the AI R&D stack. How acceleration on pretraining composes with progress on other components — such as post-training — is an open question.

The post describes the speedrun as an example of “cumulative progress on publicly tracked challenges” that is especially useful when the task maps to real AI R&D, when there is a rich record of human contributions providing a cost curve, and when agents can compete under comparable conditions. METR frames the analysis as one input to a broader research program measuring how much AI agents can accelerate AI R&D and how that is changing over time.