QIMMA validates Arabic benchmarks before running models on them — and finds systematic problems in established datasets
Researchers from TII UAE built QIMMA, the only Arabic LLM leaderboard combining quality validation, native content, and code evaluation. A two-stage pipeline of LLM scoring and human review revealed recurring quality failures across widely-used Arabic benchmarks.