Study of 14 AI agent models finds reliability gains modest despite rapid accuracy improvements

Researchers Stephan Rabanser, Sayash Kapoor, and Arvind Narayanan have published a draft paper measuring AI agent reliability across 12 dimensions, finding that capability gains have outpaced reliability improvements over an 18-month period spanning models from OpenAI, Google, and Anthropic.

TL;DR: The paper, titled “Towards a Science of AI Agent Reliability,” evaluates 14 models on two benchmarks and finds consistency scores ranging from 30% to 75%, weak agent self-knowledge about when they are wrong, and an industry-wide pattern where all three major providers cluster at similar reliability levels despite diverging accuracy scores.

What it says

The authors describe the motivation as addressing an absence of standardised tools for measuring agent reliability. They draw on concepts from aviation, nuclear, and automotive safety engineering, which they say independently converged on four high-level dimensions: consistency, robustness, predictability, and safety defined as the frequency and severity of failures. The paper refines these into 12 metrics.

The evaluation covered 14 models spanning 18 months of releases from OpenAI, Google, and Anthropic. Two benchmarks were used: GAIA, described as a general assistant benchmark, and TauBench, described as a customer service simulation. Each task was run five times with paraphrased instructions. The researchers also injected tool and environment faults to measure robustness, and elicited agent confidence scores to measure calibration. In total, 500 benchmark runs were executed.

Key findings, as reported in the paper:

Consistency: Agents that can solve a task often fail on repeated attempts under identical conditions. Outcome consistency scores ranged from 30% to 75% across models. Paraphrasing instructions with the same semantic meaning caused substantial performance drops.

Predictability: The authors describe this as the weakest dimension. They write that when agents report confidence, “it often carries little signal,” and that on one benchmark most models could not distinguish their correct predictions from incorrect ones better than chance.

Robustness: Most models handled genuine technical failures such as server crashes and API timeouts, but rephrased instructions caused substantial performance drops.

Safety: The authors note recent models are better at avoiding constraint violations, but write that financial errors such as incorrect charges “remain a common failure mode.” They define safety narrowly as bounded harm when failures occur, distinct from alignment concerns, and note they are still refining their safety metric.

Scaling: Larger models were not uniformly more reliable. The authors write that scaling improves some dimensions, including calibration and robustness, but can hurt consistency, with larger models showing more run-to-run variability.

The authors acknowledge three sources of potential error. First, the choice of dimensions involves some subjectivity. Second, they allow that very high accuracy — in the 99.9% to 99.999% range they describe as “3-5 nines” — might reduce the practical importance of reliability gaps, but argue LLM-based agents are not on track to reach that threshold. Third, projecting current linear reliability trends forward would imply reaching 100% in three years, but they write they expect each order-of-magnitude decrease in unreliability to be as difficult as the previous one.

The paper is presented through AI Snake Oil, the newsletter by Narayanan and Kapoor. The authors say they plan to launch a public AI agent reliability index to track progress across providers.