Arabic LLM benchmark landscape 2026
The canonical benchmarks
ArabicMMLU
What it is: ArabicMMLU (MBZUAI, ACL 2024)2 is natively constructed from regional school exams across North Africa, the Levant, and the Gulf — 40 subjects in Modern Standard Arabic. The separate human-translated variant, MMLU-HT (57 subjects)1, is what OALL v2 lists as the translated counterpart. The two are distinct benchmarks; the post does not conflate them. Strengths: native item construction (not translated). Breadth across STEM, humanities, social sciences, religious knowledge. What it misses: dialect handling. Cultural alignment beyond exam-style knowledge. Code-switching. Adversarial robustness. Reference score (illustrative — taken from the originating paper, not independently re-run by Annota8): GPT-4 zero-shot scored 72.5% on ArabicMMLU per Koto et al. 2024, Table 4.2 Jais-chat-30B scored 62.3% in the same evaluation.2 For current Jais, Fanar 2.0, ALLaM, and Falcon-Arabic scores, consult the live OALL v2 leaderboard6 for the snapshot date you want to quote.
AlGhafa
What it is: AlGhafa (TII, ArabicNLP 2023)3 is a multiple-choice Arabic NLU benchmark suite — originally 11 native Arabic datasets, later extended with 11 translated datasets — testing reading comprehension, reasoning, and NLU via multiple-choice QA. It is one of the core blocks on OALL v2 alongside EXAMS, Belebele, ArabicMMLU, MMLU-HT, MadinahQA, AraTrust, and ALRAGE.1 Strengths: native Arabic items in the original 11 datasets. TII / Falcon ecosystem standard. Used as a core OALL v2 block. What it misses: dialect coverage. Per-task stratification depth. Adversarial subset. Reference score (illustrative — vendor-published, not independently re-run by Annota8): Per TII’s Falcon-Arabic announcement (May 2025)7, Falcon-Arabic-7B-Base scored 67.17 on AlGhafa and Falcon-Arabic-7B-Instruct scored 72.40. For Jais and Fanar AlGhafa scores, take from the live OALL v2 leaderboard6 for the snapshot date you want to quote.
ArabicaQA
What it is: ArabicaQA (Abdallah et al., SIGIR 2024)4 is an Arabic reading-comprehension dataset built over Arabic Wikipedia — 89,095 answerable + 3,701 unanswerable questions. (Other Arabic reading-comprehension datasets exist — TyDi QA-Ar, ARCD, Arabic-SQuAD — and serve adjacent purposes.) Strengths: native Arabic source content. Tests extractive QA + unanswerability detection. What it misses: dialect QA. Reasoning depth. Cross-document QA. Sources beyond Wikipedia.
AraBench
What it is: AraBench (Sajjad et al., COLING 2020)5 is an evaluation suite for dialectal Arabic-to-English machine translation — 4 coarse / 15 fine-grained / 25 city-level dialect categories across media, chat, religion, and travel genres. It is an MT benchmark, not a multi-task NLU suite. (For multi-task Arabic NLU, look to ORCA, ALUE, or AraEval instead.) Strengths: dialect breadth at the MT task. City-level granularity. What it misses: non-MT tasks. Generation evaluation beyond translation. MSA-only deployments will get limited signal from it.
EXAMS, Belebele, MadinahQA, AraTrust, ALRAGE
The remaining OALL v2 blocks cover multilingual exam QA (EXAMS), multilingual reading comprehension (Belebele), native Arabic Islamic-knowledge QA (MadinahQA), Arabic trust/safety (AraTrust), and Arabic RAG evaluation (ALRAGE).1 Each adds a slice the others miss; together they form what is currently the closest thing to a canonical Arabic LLM benchmark suite.
What benchmark scores hide
Dialect stratification gap
A model scoring 67% aggregate on ArabicMMLU might be:
- 75% on MSA items
- 55% on Gulf-flavoured items
- 50% on Egyptian-flavoured items
- 40% on Maghrebi-flavoured items
Aggregate scores mask per-dialect-family gaps — a point DialectalArabicMMLU8 makes explicitly. For production deployment serving customers in a specific MENA region, the per-family score matters more than the aggregate.
Cultural alignment isn’t measured
Standard benchmarks don’t include cultural alignment subsets. A model that scores well on ArabicMMLU might produce religiously inappropriate or culturally tone-deaf outputs in production. ArabicMMLU doesn’t catch that.
Code-switching isn’t measured
Production MENA conversations are heavily code-switched (Arabic + English in tech / business, Arabic + French in Maghrebi). Standard benchmarks test monolingual Arabic. A model that scores 70% on ArabicMMLU may fail at code-switched production input.
Adversarial robustness isn’t measured
Standard benchmarks use well-formed prompts. Production deployment encounters adversarial / edge-case prompts: negation traps, counterfactual reasoning, multi-step inference with cultural context. Standard benchmark scores don’t tell you how the model handles these.
Eval set leakage suspicion
Models trained on web-scraped Arabic + tested on benchmarks built from Arabic Wikipedia + news may have eval set leakage. Reported scores may overstate genuine capability.
How to read benchmark scores for production decisions
Question 1: What benchmarks does the lab report?
If the lab reports only ArabicMMLU + AlGhafa aggregate scores, dialect-specific deployment is risky. Demand per-family stratification.
Question 2: What’s the cultural alignment story?
If the lab doesn’t publish an explicit cultural alignment eval, the model may produce inappropriate outputs in production. Construct your own cultural alignment test set.
Question 3: What’s the code-switching story?
If the lab doesn’t show code-switching eval, the model likely handles production code-switching poorly. Test with your own code-switched samples.
Question 4: What’s the adversarial story?
If the lab doesn’t publish adversarial / red-team eval, the model’s failure modes are unknown. Construct your own adversarial subset.
Question 5: Eval set publication transparency?
If the lab publishes eval items, you can verify integrity + check for leakage. If items are hidden, take scores with more uncertainty.
What we recommend for FM lab + serious Arabic AI deployment
- Don’t rely on aggregate scores for deployment decisions. Aggregate masks failure modes.
- Construct custom eval per your use case. ArabicMMLU isn’t a deployment readiness test for, say, customer service AI.
- Add cultural alignment + code-switching + adversarial subsets. These are non-negotiable for responsible deployment.
- Use PhD-linguist ground-truth labelling on your eval set. Crowd-sourced 15% noise masks real model differences.
- Run inter-lab comparison on the same custom test set. Different labs report different benchmarks; comparable scores require shared eval.
How Annota8 helps
Annota8 constructs eval sets that close the gaps in standard benchmarks:
- Dialect-stratified per family + sub-family
- Cultural alignment subset (Islamic + regional + family + gender + political)
- Code-switching subset
- Adversarial / red-team subset
- Native Arabic item writing (not translated)
- PhD-linguist ground-truth labelling
For FM lab + serious deployment teams, our eval methodology guide details the full framework: Arabic NLP eval methodology guide + Arabic NLP eval methodology whitepaper.
References
Footnotes
-
Open Arabic LLM Leaderboard v2 (OALL v2) blog, Hugging Face — confirms canonical benchmark set (ArabicMMLU native = 40 tasks, MMLU-HT = 57 tasks, AlGhafa, EXAMS, Belebele, MadinahQA, AraTrust, ALRAGE). https://huggingface.co/blog/leaderboard-arabic-v2 ↩ ↩2 ↩3 ↩4 ↩5
-
Koto et al., “ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic” (MBZUAI, ACL 2024) — natively constructed from regional school exams across North Africa, the Levant, and the Gulf, 40 subjects in MSA. GPT-4 zero-shot = 72.5%, Jais-chat-30B = 62.3% (Table 4). https://arxiv.org/abs/2402.12840 — Repo: https://github.com/mbzuai-nlp/ArabicMMLU ↩ ↩2 ↩3 ↩4
-
Almazrouei et al., “AlGhafa Evaluation Benchmark for Arabic Language Models” (TII, ArabicNLP 2023) — multiple-choice Arabic NLU benchmark suite, 11 native + 11 translated datasets. https://aclanthology.org/2023.arabicnlp-1.21/ — Repo: https://gitlab.com/tiiuae/alghafa ↩ ↩2
-
Abdallah et al., “ArabicaQA: A Comprehensive Dataset for Arabic Question Answering” (SIGIR 2024) — built over Arabic Wikipedia, 89,095 answerable + 3,701 unanswerable questions. https://arxiv.org/abs/2403.17848 — Repo: https://github.com/DataScienceUIBK/ArabicaQA ↩ ↩2
-
Sajjad et al., “AraBench: Benchmarking Dialectal Arabic-English Machine Translation” (COLING 2020) — dialectal MT benchmark, 4 coarse / 15 fine-grained / 25 city-level dialect categories. https://aclanthology.org/2020.coling-main.447/ — Resources: https://alt.qcri.org/resources1/mt/arabench/ ↩ ↩2
-
Open Arabic LLM Leaderboard v2 (OALL v2), live leaderboard, Hugging Face Spaces. https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard ↩ ↩2
-
TII Falcon-Arabic announcement (May 2025) — Falcon-Arabic-7B-Base = 67.17 on AlGhafa, Falcon-Arabic-7B-Instruct = 72.40 on AlGhafa. https://falcon-lm.github.io/blog/falcon-arabic/ ↩
-
DialectalArabicMMLU paper (arXiv 2510.27543) — confirms aggregate ArabicMMLU scores mask per-dialect-family gaps. https://arxiv.org/abs/2510.27543 ↩