Stanford CRFM launches HELM Arabic, a leaderboard for evaluating LLMs on Arabic benchmarks

Stanford University’s Center for Research on Foundation Models (CRFM) has released HELM Arabic, a leaderboard for evaluating large language models on Arabic-language benchmarks. The leaderboard was produced in collaboration with Arabic.AI.

HELM Arabic extends the existing HELM evaluation framework to Arabic by drawing on a collection of established Arabic-language evaluation tasks that the post describes as “widely used in the research community.” The leaderboard covers seven benchmarks in total. The post does not enumerate all seven benchmarks in the publicly available excerpt, but the citations accompanying the announcement identify the sources from which tasks were drawn, including ArabicMMLU, the AlGhafa evaluation benchmark, the EXAMS multi-subject examination dataset, AraTrust, and the JAIS model evaluation suite.

The leaderboard evaluated several leading models, which the post groups into three broad categories without specifying the full list of evaluated models or the grouping criteria in the publicly available text.

According to the post, the results show that “LLMs have made significant progress in Arabic language understanding over the last few years,” though it does not quantify that progress in the excerpt available.

HELM Arabic inherits the transparency properties of other HELM leaderboards: it provides full access to all model requests and responses, and results are reproducible using the open-source HELM framework. The post states the team hopes the leaderboard “will be a valuable resource for the Arabic NLP community.”

The ArabicMMLU benchmark, one of the referenced sources, was published at ACL 2024 and covers massive multitask language understanding in Arabic. AlGhafa, another referenced benchmark, was introduced at ArabicNLP 2023. AraTrust, a trustworthiness evaluation suite for Arabic LLMs, appeared at COLING 2025. JAIS, an Arabic-centric foundation model whose evaluation suite is also cited, was described in a 2023 arXiv paper.

The CRFM previously released HELM Capabilities and HELM Long Context leaderboards applying similar transparency-focused evaluation methodology to English-language tasks. HELM Arabic extends that effort to a language where structured, reproducible evaluation has been less available.