Stanford CRFM releases HELM Long Context leaderboard, with GPT-4.1 leading at mean score 0.588

Stanford University’s Center for Research on Foundation Models (CRFM) has released the HELM Long Context leaderboard, an evaluation of how well recent large language models handle inputs stretching into the hundreds of thousands of tokens. The leaderboard was produced through a research collaboration with LVMH and funded by the HAI Industrial Affiliate Program.

The central motivation, according to the CRFM post, is that supporting long inputs does not automatically mean a model can use them well. Earlier long-context models struggled with relatively simple tasks such as needle-in-a-haystack retrieval, and pre-existing vendor-reported results lacked comparability — different versions of the same benchmarks were used across different models, and one set of results relied on an internal benchmark not available to external researchers.

Task selection

The leaderboard evaluates models on five tasks drawn from three existing benchmarks, all filtered to input instances with at most 128K tokens.

From RULER (Hsieh et al., 2024), the team selected two question-answering tasks — RULER SQuAD and RULER HotPotQA — where short passages needed for an answer are combined with distractor documents sampled from the same dataset. The post describes this as analogous to retrieval-augmented generation, where not all retrieved documents are relevant. RULER is configurable; for these tasks, the team set sequence length to 128K tokens using a whitespace tokenizer.

From ∞Bench (Zhang et al., 2024), the team selected ∞Bench En.MC and ∞Bench En.Sum. Both tasks use English novels as source material, with key entity names replaced to reduce train-test overlap. The full ∞Bench dataset contains 3,946 examples with an average length of approximately 200K tokens; instances above 128K tokens under whitespace tokenization were filtered out.

The fifth task is OpenAI-MRCR, an open-source version of the Multi-Round Co-reference Resolution task. MRCR requires locating information from multiple positions in a long synthetic conversation — the post describes it as a more demanding extension of needle-in-a-haystack. The accuracy metric is computed using Python’s difflib.SequenceMatcher; responses that omit the required prefix receive a score of zero.

For each task, 100 instances were sub-sampled for use in the evaluations.

Model selection and results

The leaderboard evaluated 10 models from five organizations, with context lengths ranging from 300K to 10M tokens. The post notes that models were selected for their strong performance on the HELM Capabilities leaderboard, which measures general LLM capabilities. Of the 10 models evaluated, only the Meta Llama 4 models are open-weights; the remainder are closed-weights.

GPT-4.1 obtained the highest mean score of 0.588. It also obtained the highest scores on RULER HotPotQA, RULER SQuAD, and ∞Bench En.MC, and the second highest score on ∞Bench En.Sum. Llama 4 Scout (17Bx16E) Instruct obtained the highest score on ∞Bench En.Sum. Palmyra X5 obtained the highest score on MRCR.

Across model families, the post observes that performance generally increased with model size — with one exception: within the Amazon Nova family, Amazon Nova Lite outperformed Amazon Nova Pro, achieving a higher mean score and higher scores on three of the five benchmarks.

The ranking on the long-context leaderboard closely tracked general-capability rankings: the Spearman rank correlation between the two leaderboards was 0.90 (p=0.00016), with GPT-4.1 topping both.

Room for improvement

Despite progress, the post emphasizes that long-context performance remains limited. The highest accuracy score on MRCR was only 0.256, even though the post describes MRCR as “a computationally simple task.”

The post also acknowledges what the leaderboard does not cover. Several recently proposed benchmarks address shortcomings of the chosen tasks but were not included. The leaderboard also lacks coverage of realistic long-context tasks that arise in industrial settings — the excerpt notes this explicitly but is cut off before elaborating further.

Pre-existing vendor benchmark results were found to be neither comprehensive nor comparable. The post notes that the Gemini 2.0 results used an internal MRCR version unavailable to outside researchers, and the Llama 4 results did not specify which NIAH version was used. HELM Long Context standardizes these evaluations across all 10 models under the same conditions.