Stanford University’s Center for Research on Foundation Models (CRFM) has integrated Item Response Theory-based adaptive testing into the HELM evaluation framework, with the goal of reducing the cost of large language model benchmarking while preserving the reliability of results. The work is described in a post on the CRFM site and an accompanying paper.

The problem: evaluating LLMs is expensive

According to the post, modern benchmarks can involve hundreds of thousands of questions, and evaluating a single model against them can take hours, days, or even weeks of compute. Grading answers — particularly when using LM-based judges — may cost hundreds of human annotator hours or thousands of dollars. The team frames adaptive testing as a way to reduce this cost without sacrificing the accuracy of the resulting ability estimates.

The Rasch model

The approach centers on the Rasch model from Item Response Theory (IRT), a psychometric method originally developed for educational testing. The Rasch model represents each question by a difficulty parameter and each model by an ability parameter. The difference between a model’s ability and a question’s difficulty predicts the likelihood that the model answers correctly.

The implementation has two phases. In the calibration phase, the team analyzed historical evaluation data to estimate ability parameters for models and difficulty parameters for questions across a dataset. In the adaptive testing phase, these pre-calibrated difficulty parameters guide question selection: rather than exposing every model to a static set of questions, the system dynamically selects questions that are most informative for estimating that model’s ability, guided by Fisher information criteria.

The post reports that Rasch-estimated ability is effectively identical to conventional average-score metrics as a ranking measure. On the Civil Comments dataset, the Pearson correlation between Rasch-estimated ability and average score was 0.99, with identical model rankings across both metrics. The team’s description frames this as confirming that the Rasch model captures the same underlying signal as average scores — meaning it can serve as a proxy without distorting comparisons.

Calibration accuracy

The team validated the Rasch model’s calibration accuracy across 22 datasets drawn from five HELM leaderboards. The dataset covered 183 LLMs and over 78,000 questions, encompassing both capability and safety measurements. Performance was assessed using out-of-sample prediction: how well the calibrated model predicts which LLM will correctly answer which question on a held-out test set.

The results show the Rasch model achieved an AUC-ROC of 0.85 on the training set and 0.83 on the test set on average, which the post describes as indicating that the model “reliably reflects the LLMs’ performance across a wide range of questions.”

Adaptive testing efficiency

To demonstrate the efficiency benefit of adaptive testing, the team ran an experiment on Llama 3.1 8B — a held-out model not used during calibration — using Civil Comments as the evaluation dataset. They compared random question selection against adaptive testing guided by Fisher information criteria, treating the ability estimate derived from all questions as the ground truth and measuring mean squared error (MSE) of the estimated ability over the first 200 questions selected.

According to the post, the adaptive testing method “significantly outperforms the random approach, demonstrating a more efficient and precise evaluation process.” Specific MSE values are shown in the paper’s figures but are not quoted verbatim in the post.

The difficulty parameters derived from calibration have been uploaded to HuggingFace at stair-lab/reeval-difficulty-for-helm, currently covering 22 distinct HELM datasets. The adaptive testing code has been integrated into the HELM framework, and code for question difficulty estimation is linked from the post.

Scope and application

The post describes the approach as applicable to evaluations where a set of reference models has already been evaluated on a benchmark, providing the calibration data needed to estimate question difficulty. For new benchmarks without prior model evaluations, calibration would need to be performed before adaptive testing can be applied.

The team describes the system as setting “the stage for streamlined and scalable evaluation across diverse testing scenarios,” and notes that as evaluations grow more costly — particularly with expensive LM judges — Rasch-model adaptive testing offers a way to maintain reliability at lower per-model cost.