Why Arabic LLMs fail in commercial use — a diagnosis
Why this deserves a diagnosis, not a complaint
As Annota8’s founder — and as a buyer of training-data services in prior roles, where we used V7, Kognic, and Scale AI as customers before building our own — I’ve watched the same pattern repeat with every new model wave. A lab announces an Arabic model that beats GPT-4 on ArabicMMLU by two points. A product team at a Gulf bank deploys it behind a customer-service bot. Three weeks later the support tickets arrive: replies sound stiltedly formal, the bot doesn’t understand “I want to switch to my other account” in dialect, and every time a product name appears in English the model hallucinates a reference number.
This isn’t a single failure story. It’s a structural gap in how Arabic LLMs are built and evaluated in 2024–2026. Below are the seven root causes I see most often, each with a practical recommendation.
Cause 1: training data is MSA, production is dialect
The bulk of Arabic text available at industrial scale — Wikipedia, cleaned Common Crawl, classical books, news — is Modern Standard Arabic (MSA). Customer conversations, WhatsApp messages pasted into a chatbot, social comments, even call-center transcripts — those are dialect.
Practical proportions I’ve measured on customer samples in 2025–2026 (internal estimates; vendors rarely disclose this breakdown):
| Variety | Share in typical Arabic-LLM training | Share in typical MENA production volume |
|---|---|---|
| MSA | 80–95% | 10–20% |
| Gulf | 1–5% | 25–40% |
| Egyptian | 2–6% | 20–35% |
| Levantine | 1–4% | 10–20% |
| Maghrebi | <1% | 5–15% |
A model trained on that mix behaves like an Arabic-major who graduated with honors from an international university and was then dropped into a Kuwait support desk. The language in its head is not the language in the street.
Recommendation: ask any model vendor to disclose dialect-family distribution. If they can’t, expect a meaningful production drop on dialect-heavy traffic — dialect-specific evals such as DialectalArabicMMLU and the UI-level ALLaM 34B evaluation show double-digit gaps between MSA and dialect performance.3
Cause 2: evals are translated from English MMLU, not built natively
Translated-MMLU style Arabic benchmarks — the predecessors to the native ArabicMMLU — were essentially direct translations of English MMLU. That creates three problems:
- Translation artifacts: an Arabic question about the US electoral system doesn’t measure understanding of an electoral system — it measures ability to back-translate.
- Cultural bias: topics around US law, US sports, US foods — not appropriate proxies for competence in regional Arabic.
- Possible leakage: the Arabic web contains translations of MMLU questions. A model trained on broad scrape may have seen the items verbatim.
ArabicMMLU (the native version by Koto et al. 2024) and ArabicMMLU-Pro tried to correct this by authoring native Arabic items from regional school exams across North Africa, Levant, and Gulf,2 but commercial reliance on the translated aggregate score persists in lab pitches.
Recommendation: build your own eval set — 200–500 native Arabic items from your industry context, labeled by a PhD linguist. That’s a truer signal than any public board.
Cause 3: low-quality SFT — “chosen” responses machine-translated
The typical SFT (Supervised Fine-Tuning) chain starts with an English instruction set (Alpaca, ShareGPT, Anthropic HH-RLHF) and machine-translates it to Arabic with NLLB or similar. The translated outputs are then used as “chosen” responses in SFT or DPO.4
What happens in production:
- The model learns a machine-translation style, not a native Arabic writer’s style
- Punctuation, tanwin usage, sentence length — all English wrapped in Arabic glyphs
- Technical terms are translated literally where the user expects the English term (“بطاقة الرسوميّات” instead of GPU)
A banking customer asking “what’s my balance” gets a reply that reads like 2018 Google Translate. That destroys trust before it destroys accuracy.
Recommendation: ask any vendor for SFT samples from your use case and have a human linguist score the “chosen” quality. A 300–500 sample is enough to judge.
Cause 4: religious and cultural sensitivity — an unclosed gap
A model trained on open web carries:
- Strong opinions on religions and sects
- Jokes that demean religious or regional groups
- Wrong information on Islamic financial instruments (takaful, murabaha, ijara)
- Unauthenticated Qur’anic or hadith translations
In Arabic production the cost of one error of this kind is far higher than a typical factual error — it can become a news item on a local platform, a support ticket that reaches a ministry, or a product suspension. Yet very few of the publicly-released Arabic models in 2026 publish an explicit religious-cultural red-teaming report on their model card; the practice is still early-stage across the field rather than an industry standard. (Fanar publishes a moderation component, FanarGuard, but most other 2026-era model cards do not include a dedicated religious-cultural red-teaming section.)1
Recommendation: any end-user deployment in GCC or North Africa needs an explicit cultural-alignment layer — RLHF with local labelers, or at minimum a pre-output filter. Generic SFT is insufficient.
Cause 5: tashkeel (diacritization) in TTS and the pronunciation gap
Most Arabic TTS systems rely on diacritized text (fatha, damma, kasra) to produce correct pronunciation.5 The text coming out of a commercial Arabic LLM in the great majority of cases has no diacritics. In a unified LLM→TTS pipeline:
User: What's the share price today?
LLM (undiacritized): سعر السهم سَجَّلَ ارتفاعًا ملحوظًا اليوم.
TTS without tashkeel: degraded, ambiguous pronunciation
TTS with diacritized input: natural, accurate pronunciation
The gap isn’t in how TTS is trained — it’s that the LLM doesn’t emit pronounceable text. Anyone building an Arabic voice experience needs either a TTS model that doesn’t rely on diacritics (rare and inaccurate), a tashkeel layer between LLM and TTS, or fine-tuning the LLM to emit diacritized output for names and ambiguity cases.
Recommendation: if the use case is voice, treat tashkeel as a pipeline component, not a grammatical nicety.
Cause 6: code-switching isn’t tested
A real Gulf or Egyptian conversation looks like:
"يا أخ ودّي أعمل reset لـ password بتاع الـ account
لأنّي نسيته من فترة، تقدر تساعدني؟"
Three languages in one sentence: Arabic, technical English, Arabic written in local pronunciation. Standard eval sets test monolingual Arabic. The result:
- The model parses “reset password” as two separate phrases
- It produces a formal “لإعادة ضبط كلمة المرور…” reply where the customer wrote casually
- Downstream intent classifiers fail because they were trained on clean Arabic intents
Recommendation: build a code-switched eval set — at least 200 samples that mirror the code-switch ratio in your actual production volume. Measure accuracy separately on this set from your MSA-only baseline.
Cause 7: tokenizer inefficiency on Arabic morphology
Arabic is a highly derivational-agglutinative language: one word can carry a root + prefix + suffix + pronoun + plural marker. A typical multilingual-web tokenizer splits a single agglutinative Arabic word like “وسيكتبونها” into many subwords (typically 5–9 across common multilingual tokenizers), where the equivalent English phrase takes fewer.6
The practical impact is threefold:
- Inference cost roughly 1.4–2x higher for equivalent content vs. English (order-of-magnitude estimate; exact multiplier varies by tokenizer)6
- Smaller effective context window for Arabic content vs. equivalent English content
- Weaker embedding quality per semantic unit
ALLaM, Jais, and Fanar have tried to mitigate this with Arabic-adapted tokenizers,7 but deploying on top of GPT-4 or Claude APIs without a preprocessing layer pushes the overhead to the buyer.
Recommendation: when comparing models, compute cost per 1,000 Arabic words, not per 1,000 tokens. The difference can invert the economic decision.
Diagnosis summary
| Cause | Where it shows up | Fix |
|---|---|---|
| MSA vs. dialect | Customer chat | Dialect-weighted training data |
| Translated eval | Benchmark boards | Native Arabic custom eval |
| Machine-translated SFT | Response style | Native human-authored “chosen” |
| Cultural gap | Public incidents | RLHF alignment + filter |
| Tashkeel gap | TTS pipeline | Tashkeel layer or fine-tune |
| Code-switching | Intent + replies | Code-switched eval set |
| Tokenizer efficiency | Cost + context | Arabic-adapted tokenizer |
How we help at Annota8
We don’t build foundation models — we produce the data that makes their commercial performance hold up. Our QA tier is built on PhD-level linguists in Cairo who annotate SFT/DPO natively in Arabic, build code-switched evals, and run religious-cultural red-teaming. That’s the difference between a model that tops a leaderboard and a model that ships in a Saudi bank without a support ticket in the first week.
References
Footnotes
-
Arabic LLMs referenced — ALLaM (SDAIA/NCAI, 7B/13B/34B/70B): https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview and https://arxiv.org/abs/2508.17378 ; Jais (Inception/MBZUAI/Cerebras, 13B): https://huggingface.co/inceptionai/jais-13b ; Fanar (QCRI, 9B/27B): https://arxiv.org/pdf/2501.13944 and https://huggingface.co/QCRI/Fanar-2-27B-Instruct ; Falcon Arabic (TII, May 2025): https://falconllm.tii.ae/falcon-arabic.html ; Karnak (AIC Egypt/ITIDA, Feb 2026, 30B–80B): https://huggingface.co/Applied-Innovation-Center/Karnak ; FanarGuard moderation component: https://qcai.qcri.org/uncategorized/fanar/ ↩ ↩2
-
Koto et al., “ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic” (ACL 2024 Findings). 40 tasks, 14,575 MCQs constructed natively from regional school exams across North Africa, Levant, and Gulf. https://arxiv.org/abs/2402.12840 ; https://aclanthology.org/2024.findings-acl.334/ ; https://github.com/mbzuai-nlp/ArabicMMLU ↩ ↩2
-
Dialect / code-switching evaluation gap — DialectalArabicMMLU (arXiv:2510.27543) was created precisely because mainstream Arabic benchmarks under-evaluate dialect; UI-level evaluation of ALLaM 34B reports dialect fidelity (4.21/5) below MSA (4.74/5). See also AL-QASIDA and “A Review of Arabic Post-Training Datasets and Their Limitations”. https://arxiv.org/pdf/2510.27543 ; https://arxiv.org/abs/2508.17378 ; https://arxiv.org/html/2507.14688v2 ; https://arxiv.org/html/2412.04193 ↩
-
Machine-translated SFT recipes — “A Review of Arabic Post-Training Datasets and Their Limitations” (arXiv:2507.14688) and AceGPT methodology document that Arabic SFT pipelines commonly rely on translated English instruction sets (e.g., AceGPT translated Alpaca via GPT-4) with documented quality artifacts. https://arxiv.org/html/2507.14688v2 ↩
-
Arabic TTS / tashkeel requirement — multiple Arabic TTS systems and pipelines explicitly require or strongly benefit from tashkeel input (Telnyx NaturalHD, Arabic F5-TTS-v2, Sadeed, CATT). https://telnyx.com/resources/arabic-tts-with-tashkeel-support ; https://huggingface.co/IbrahimSalah/Arabic-F5-TTS-v2 ; https://arxiv.org/html/2504.21635v1 ; https://arxiv.org/html/2407.03236v1 ↩
-
Arabic tokenizer fertility / cost overhead — “The Token Tax” (arXiv:2509.05486) and AraToken (arXiv:2512.18399) document that English-first tokenizers produce substantially more tokens per Arabic word than per English word; AraToken reports ~1.2 tokens/word as a normalized baseline. https://arxiv.org/html/2509.05486v1 ; https://arxiv.org/pdf/2512.18399 ; https://arxiv.org/pdf/2106.07540 ↩ ↩2
-
Arabic-adapted tokenizers — ALLaM uses vocabulary expansion for Arabic morphology; Jais uses an Arabic-tuned BPE; Fanar 2.0 continually pretrains on Gemma with Arabic-centric vocabulary adaptation. https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview ; https://huggingface.co/inceptionai/jais-13b ; https://qcai.qcri.org/uncategorized/fanar/ ↩