METR has published a review of how different statistical assumptions affect its AI time horizon metric — the measure the organisation uses to characterise how long a task an AI agent can reliably complete. The post was prompted by growing sensitivity in the results as the underlying task suite saturates, and by a recent correction to a modelling mistake involving regularisation.
That regularisation fix, the post notes, decreased recent models’ 50% time horizon results by up to 20%, while having a smaller impact on earlier language models’ 50% time horizons.
The underlying model
METR’s time horizon metric rests on a suite of approximately 230 tasks drawn from around 80 task families, each associated with a task length representing the time a human would need to complete it. Each language model attempts each task multiple times — typically eight — and the results are scored as binary pass or fail.
The statistical model estimates, for each model, the task length at which it is predicted to succeed 50% of the time (the 50% time horizon) and a slope parameter used to compute the 80% time horizon. Confidence intervals are constructed using a hierarchical bootstrap across task families, individual tasks, and attempts.
Sources of uncertainty
The post identifies five main sources of uncertainty beyond needing more tasks in the suite.
Task distribution. With roughly 230 tasks covering a range from one second to over 20 hours — and no tasks exceeding 30 hours — the distribution of task types at a given length has substantial influence on results. According to the post, this is already reflected in current confidence intervals, which are often a factor of two in both directions.
Task-length to success-rate modelling. The logistic model is sensitive to how well models perform on very easy tasks, tending to fit shallow slopes that depress 80% time horizons and inflate 50% time horizons. The post states that reasonable alternative fits for recent models decrease 50% time horizons by up to 35% and increase 80% time horizons by up to 100%.
Public versus private tasks. Approximately 15% of the task suite is publicly available. Models typically perform similarly whether or not public tasks are included, but with one exception: Claude Opus 4.6 sees a 40% reduction in 50% time horizon when public tasks are excluded, driven by high scores on RE-Bench, which the post describes as appearing legitimate.
Noise in task length estimates. The post describes this as the factor the author feels least certain about. Human-time-to-complete values are estimates, and noise in those estimates is expected to bias 80% time horizon values down and 50% time horizon values up for the most capable models. One approach the post describes suggests correcting for this could reduce frontier language model 50% time horizons by roughly 30%, but the post notes the uncertainty is very high, with a range the author cannot rule out anywhere from 0% to 60%.
More complex modelling. The post mentions several more complex approaches — including allowing task difficulty to differ from baseliner time and using Bayesian models — but describes these as not yet explored thoroughly enough to present.
Overall assessment
The post’s overall assessment is that reasonable alternative modelling choices generally leave results within the current (wide) confidence intervals. The author states the most important source of uncertainty is the task distribution rather than analysis choices, because which tasks are included has a very large impact on the results, and this is what drives the wide CIs already published.
The post is separate from questions about the meaning and applicability of the time horizon metric itself, which METR has addressed in a separate note on time horizon limitations. Background on the metric is also available in a dedicated METR page, a March 2025 blog post, and a paper on arXiv at arxiv.org/abs/2503.14499.