26 May 2026 Arabic llm benchmark 2026

Arabic LLM benchmark landscape 2026

The canonical benchmarks

ArabicMMLU

What it is: ArabicMMLU (MBZUAI, ACL 2024)² is natively constructed from regional school exams across North Africa, the Levant, and the Gulf — 40 subjects in Modern Standard Arabic. The separate human-translated variant, MMLU-HT (57 subjects)¹, is what OALL v2 lists as the translated counterpart. The two are distinct benchmarks; the post does not conflate them. Strengths: native item construction (not translated). Breadth across STEM, humanities, social sciences, religious knowledge. What it misses: dialect handling. Cultural alignment beyond exam-style knowledge. Code-switching. Adversarial robustness. Reference score (illustrative — taken from the originating paper, not independently re-run by Annota8): GPT-4 zero-shot scored 72.5% on ArabicMMLU per Koto et al. 2024, Table 4.² Jais-chat-30B scored 62.3% in the same evaluation.² For current Jais, Fanar 2.0, ALLaM, and Falcon-Arabic scores, consult the live OALL v2 leaderboard⁶ for the snapshot date you want to quote.

AlGhafa

What it is: AlGhafa (TII, ArabicNLP 2023)³ is a multiple-choice Arabic NLU benchmark suite — originally 11 native Arabic datasets, later extended with 11 translated datasets — testing reading comprehension, reasoning, and NLU via multiple-choice QA. It is one of the core blocks on OALL v2 alongside EXAMS, Belebele, ArabicMMLU, MMLU-HT, MadinahQA, AraTrust, and ALRAGE.¹ Strengths: native Arabic items in the original 11 datasets. TII / Falcon ecosystem standard. Used as a core OALL v2 block. What it misses: dialect coverage. Per-task stratification depth. Adversarial subset. Reference score (illustrative — vendor-published, not independently re-run by Annota8): Per TII’s Falcon-Arabic announcement (May 2025)⁷, Falcon-Arabic-7B-Base scored 67.17 on AlGhafa and Falcon-Arabic-7B-Instruct scored 72.40. For Jais and Fanar AlGhafa scores, take from the live OALL v2 leaderboard⁶ for the snapshot date you want to quote.

ArabicaQA

What it is: ArabicaQA (Abdallah et al., SIGIR 2024)⁴ is an Arabic reading-comprehension dataset built over Arabic Wikipedia — 89,095 answerable + 3,701 unanswerable questions. (Other Arabic reading-comprehension datasets exist — TyDi QA-Ar, ARCD, Arabic-SQuAD — and serve adjacent purposes.) Strengths: native Arabic source content. Tests extractive QA + unanswerability detection. What it misses: dialect QA. Reasoning depth. Cross-document QA. Sources beyond Wikipedia.

AraBench

What it is: AraBench (Sajjad et al., COLING 2020)⁵ is an evaluation suite for dialectal Arabic-to-English machine translation — 4 coarse / 15 fine-grained / 25 city-level dialect categories across media, chat, religion, and travel genres. It is an MT benchmark, not a multi-task NLU suite. (For multi-task Arabic NLU, look to ORCA, ALUE, or AraEval instead.) Strengths: dialect breadth at the MT task. City-level granularity. What it misses: non-MT tasks. Generation evaluation beyond translation. MSA-only deployments will get limited signal from it.

EXAMS, Belebele, MadinahQA, AraTrust, ALRAGE

The remaining OALL v2 blocks cover multilingual exam QA (EXAMS), multilingual reading comprehension (Belebele), native Arabic Islamic-knowledge QA (MadinahQA), Arabic trust/safety (AraTrust), and Arabic RAG evaluation (ALRAGE).¹ Each adds a slice the others miss; together they form what is currently the closest thing to a canonical Arabic LLM benchmark suite.

What benchmark scores hide

Dialect stratification gap

A model scoring 67% aggregate on ArabicMMLU might be:

75% on MSA items
55% on Gulf-flavoured items
50% on Egyptian-flavoured items
40% on Maghrebi-flavoured items

Aggregate scores mask per-dialect-family gaps — a point DialectalArabicMMLU⁸ makes explicitly. For production deployment serving customers in a specific MENA region, the per-family score matters more than the aggregate.

Cultural alignment isn’t measured

Standard benchmarks don’t include cultural alignment subsets. A model that scores well on ArabicMMLU might produce religiously inappropriate or culturally tone-deaf outputs in production. ArabicMMLU doesn’t catch that.

Code-switching isn’t measured

Production MENA conversations are heavily code-switched (Arabic + English in tech / business, Arabic + French in Maghrebi). Standard benchmarks test monolingual Arabic. A model that scores 70% on ArabicMMLU may fail at code-switched production input.

Adversarial robustness isn’t measured

Standard benchmarks use well-formed prompts. Production deployment encounters adversarial / edge-case prompts: negation traps, counterfactual reasoning, multi-step inference with cultural context. Standard benchmark scores don’t tell you how the model handles these.

Eval set leakage suspicion

Models trained on web-scraped Arabic + tested on benchmarks built from Arabic Wikipedia + news may have eval set leakage. Reported scores may overstate genuine capability.

How to read benchmark scores for production decisions

Question 1: What benchmarks does the lab report?

If the lab reports only ArabicMMLU + AlGhafa aggregate scores, dialect-specific deployment is risky. Demand per-family stratification.

Question 2: What’s the cultural alignment story?

If the lab doesn’t publish an explicit cultural alignment eval, the model may produce inappropriate outputs in production. Construct your own cultural alignment test set.

Question 3: What’s the code-switching story?

If the lab doesn’t show code-switching eval, the model likely handles production code-switching poorly. Test with your own code-switched samples.

Question 4: What’s the adversarial story?

If the lab doesn’t publish adversarial / red-team eval, the model’s failure modes are unknown. Construct your own adversarial subset.

Question 5: Eval set publication transparency?

If the lab publishes eval items, you can verify integrity + check for leakage. If items are hidden, take scores with more uncertainty.

Don’t rely on aggregate scores for deployment decisions. Aggregate masks failure modes.
Construct custom eval per your use case. ArabicMMLU isn’t a deployment readiness test for, say, customer service AI.
Add cultural alignment + code-switching + adversarial subsets. These are non-negotiable for responsible deployment.
Use PhD-linguist ground-truth labelling on your eval set. Crowd-sourced 15% noise masks real model differences.
Run inter-lab comparison on the same custom test set. Different labs report different benchmarks; comparable scores require shared eval.

How Annota8 helps

Annota8 constructs eval sets that close the gaps in standard benchmarks:

Dialect-stratified per family + sub-family
Cultural alignment subset (Islamic + regional + family + gender + political)
Code-switching subset
Adversarial / red-team subset
Native Arabic item writing (not translated)
PhD-linguist ground-truth labelling

For FM lab + serious deployment teams, our eval methodology guide details the full framework: Arabic NLP eval methodology guide + Arabic NLP eval methodology whitepaper.

Discuss Arabic LLM eval → 30-min session Read eval methodology

References

Open Arabic LLM Leaderboard v2 (OALL v2) blog, Hugging Face — confirms canonical benchmark set (ArabicMMLU native = 40 tasks, MMLU-HT = 57 tasks, AlGhafa, EXAMS, Belebele, MadinahQA, AraTrust, ALRAGE). https://huggingface.co/blog/leaderboard-arabic-v2 ↩ ↩² ↩³ ↩⁴ ↩⁵
Koto et al., “ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic” (MBZUAI, ACL 2024) — natively constructed from regional school exams across North Africa, the Levant, and the Gulf, 40 subjects in MSA. GPT-4 zero-shot = 72.5%, Jais-chat-30B = 62.3% (Table 4). https://arxiv.org/abs/2402.12840 — Repo: https://github.com/mbzuai-nlp/ArabicMMLU ↩ ↩² ↩³ ↩⁴
Almazrouei et al., “AlGhafa Evaluation Benchmark for Arabic Language Models” (TII, ArabicNLP 2023) — multiple-choice Arabic NLU benchmark suite, 11 native + 11 translated datasets. https://aclanthology.org/2023.arabicnlp-1.21/ — Repo: https://gitlab.com/tiiuae/alghafa ↩ ↩²
Abdallah et al., “ArabicaQA: A Comprehensive Dataset for Arabic Question Answering” (SIGIR 2024) — built over Arabic Wikipedia, 89,095 answerable + 3,701 unanswerable questions. https://arxiv.org/abs/2403.17848 — Repo: https://github.com/DataScienceUIBK/ArabicaQA ↩ ↩²
Sajjad et al., “AraBench: Benchmarking Dialectal Arabic-English Machine Translation” (COLING 2020) — dialectal MT benchmark, 4 coarse / 15 fine-grained / 25 city-level dialect categories. https://aclanthology.org/2020.coling-main.447/ — Resources: https://alt.qcri.org/resources1/mt/arabench/ ↩ ↩²
Open Arabic LLM Leaderboard v2 (OALL v2), live leaderboard, Hugging Face Spaces. https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard ↩ ↩²
TII Falcon-Arabic announcement (May 2025) — Falcon-Arabic-7B-Base = 67.17 on AlGhafa, Falcon-Arabic-7B-Instruct = 72.40 on AlGhafa. https://falcon-lm.github.io/blog/falcon-arabic/ ↩
DialectalArabicMMLU paper (arXiv 2510.27543) — confirms aggregate ArabicMMLU scores mask per-dialect-family gaps. https://arxiv.org/abs/2510.27543 ↩

Limitations & disclaimer

Limitations of this analysis. This post reflects Annota8's reading of publicly available evidence as of its last-modified date. Vendor positioning, regulatory frameworks, benchmark numbers, and program scope can change without notice. Where numeric ranges are cited, those numbers are reproducible from the source linked in the post's References section — Annota8 has not independently re-run the benchmarks unless explicitly stated in the post.

Privacy & legal posture. Annota8 is an early-stage AI data operations company in soft launch. We do not currently hold SOC 2, ISO 27001, PDPL certification, or any other third-party security or privacy certification. We design with PDPL principles in mind and can sign a DPA modelled on the EU SCC template. Specific compliance posture for your engagement is available on request from [email protected].

Nothing in this post is legal, tax, or investment advice. Regulatory citations should be verified with counsel in your jurisdiction. Vendor names mentioned in this post are referenced as industry-landscape context only — Annota8 is not asserting a comparative product claim, a customer relationship, or any other affiliation with any platform named, unless that affiliation is explicitly stated.

Reach the team:[email protected] · annota8.ai