All posts

Why Arabic LLMs fail in commercial use — a diagnosis

Why this deserves a diagnosis, not a complaint

As Annota8’s founder — and as a buyer of training-data services in prior roles, where we used V7, Kognic, and Scale AI as customers before building our own — I’ve watched the same pattern repeat with every new model wave. A lab announces an Arabic model that beats GPT-4 on ArabicMMLU by two points. A product team at a Gulf bank deploys it behind a customer-service bot. Three weeks later the support tickets arrive: replies sound stiltedly formal, the bot doesn’t understand “I want to switch to my other account” in dialect, and every time a product name appears in English the model hallucinates a reference number.

This isn’t a single failure story. It’s a structural gap in how Arabic LLMs are built and evaluated in 2024–2026. Below are the seven root causes I see most often, each with a practical recommendation.

Cause 1: training data is MSA, production is dialect

The bulk of Arabic text available at industrial scale — Wikipedia, cleaned Common Crawl, classical books, news — is Modern Standard Arabic (MSA). Customer conversations, WhatsApp messages pasted into a chatbot, social comments, even call-center transcripts — those are dialect.

Practical proportions I’ve measured on customer samples in 2025–2026 (internal estimates; vendors rarely disclose this breakdown):

VarietyShare in typical Arabic-LLM trainingShare in typical MENA production volume
MSA80–95%10–20%
Gulf1–5%25–40%
Egyptian2–6%20–35%
Levantine1–4%10–20%
Maghrebi<1%5–15%

A model trained on that mix behaves like an Arabic-major who graduated with honors from an international university and was then dropped into a Kuwait support desk. The language in its head is not the language in the street.

Recommendation: ask any model vendor to disclose dialect-family distribution. If they can’t, expect a meaningful production drop on dialect-heavy traffic — dialect-specific evals such as DialectalArabicMMLU and the UI-level ALLaM 34B evaluation show double-digit gaps between MSA and dialect performance.3

Cause 2: evals are translated from English MMLU, not built natively

Translated-MMLU style Arabic benchmarks — the predecessors to the native ArabicMMLU — were essentially direct translations of English MMLU. That creates three problems:

  1. Translation artifacts: an Arabic question about the US electoral system doesn’t measure understanding of an electoral system — it measures ability to back-translate.
  2. Cultural bias: topics around US law, US sports, US foods — not appropriate proxies for competence in regional Arabic.
  3. Possible leakage: the Arabic web contains translations of MMLU questions. A model trained on broad scrape may have seen the items verbatim.

ArabicMMLU (the native version by Koto et al. 2024) and ArabicMMLU-Pro tried to correct this by authoring native Arabic items from regional school exams across North Africa, Levant, and Gulf,2 but commercial reliance on the translated aggregate score persists in lab pitches.

Recommendation: build your own eval set — 200–500 native Arabic items from your industry context, labeled by a PhD linguist. That’s a truer signal than any public board.

Cause 3: low-quality SFT — “chosen” responses machine-translated

The typical SFT (Supervised Fine-Tuning) chain starts with an English instruction set (Alpaca, ShareGPT, Anthropic HH-RLHF) and machine-translates it to Arabic with NLLB or similar. The translated outputs are then used as “chosen” responses in SFT or DPO.4

What happens in production:

A banking customer asking “what’s my balance” gets a reply that reads like 2018 Google Translate. That destroys trust before it destroys accuracy.

Recommendation: ask any vendor for SFT samples from your use case and have a human linguist score the “chosen” quality. A 300–500 sample is enough to judge.

Cause 4: religious and cultural sensitivity — an unclosed gap

A model trained on open web carries:

In Arabic production the cost of one error of this kind is far higher than a typical factual error — it can become a news item on a local platform, a support ticket that reaches a ministry, or a product suspension. Yet very few of the publicly-released Arabic models in 2026 publish an explicit religious-cultural red-teaming report on their model card; the practice is still early-stage across the field rather than an industry standard. (Fanar publishes a moderation component, FanarGuard, but most other 2026-era model cards do not include a dedicated religious-cultural red-teaming section.)1

Recommendation: any end-user deployment in GCC or North Africa needs an explicit cultural-alignment layer — RLHF with local labelers, or at minimum a pre-output filter. Generic SFT is insufficient.

Cause 5: tashkeel (diacritization) in TTS and the pronunciation gap

Most Arabic TTS systems rely on diacritized text (fatha, damma, kasra) to produce correct pronunciation.5 The text coming out of a commercial Arabic LLM in the great majority of cases has no diacritics. In a unified LLM→TTS pipeline:

User: What's the share price today?
LLM (undiacritized): سعر السهم سَجَّلَ ارتفاعًا ملحوظًا اليوم.
TTS without tashkeel: degraded, ambiguous pronunciation
TTS with diacritized input: natural, accurate pronunciation

The gap isn’t in how TTS is trained — it’s that the LLM doesn’t emit pronounceable text. Anyone building an Arabic voice experience needs either a TTS model that doesn’t rely on diacritics (rare and inaccurate), a tashkeel layer between LLM and TTS, or fine-tuning the LLM to emit diacritized output for names and ambiguity cases.

Recommendation: if the use case is voice, treat tashkeel as a pipeline component, not a grammatical nicety.

Cause 6: code-switching isn’t tested

A real Gulf or Egyptian conversation looks like:

"يا أخ ودّي أعمل reset لـ password بتاع الـ account
لأنّي نسيته من فترة، تقدر تساعدني؟"

Three languages in one sentence: Arabic, technical English, Arabic written in local pronunciation. Standard eval sets test monolingual Arabic. The result:

Recommendation: build a code-switched eval set — at least 200 samples that mirror the code-switch ratio in your actual production volume. Measure accuracy separately on this set from your MSA-only baseline.

Cause 7: tokenizer inefficiency on Arabic morphology

Arabic is a highly derivational-agglutinative language: one word can carry a root + prefix + suffix + pronoun + plural marker. A typical multilingual-web tokenizer splits a single agglutinative Arabic word like “وسيكتبونها” into many subwords (typically 5–9 across common multilingual tokenizers), where the equivalent English phrase takes fewer.6

The practical impact is threefold:

  1. Inference cost roughly 1.4–2x higher for equivalent content vs. English (order-of-magnitude estimate; exact multiplier varies by tokenizer)6
  2. Smaller effective context window for Arabic content vs. equivalent English content
  3. Weaker embedding quality per semantic unit

ALLaM, Jais, and Fanar have tried to mitigate this with Arabic-adapted tokenizers,7 but deploying on top of GPT-4 or Claude APIs without a preprocessing layer pushes the overhead to the buyer.

Recommendation: when comparing models, compute cost per 1,000 Arabic words, not per 1,000 tokens. The difference can invert the economic decision.

Diagnosis summary

CauseWhere it shows upFix
MSA vs. dialectCustomer chatDialect-weighted training data
Translated evalBenchmark boardsNative Arabic custom eval
Machine-translated SFTResponse styleNative human-authored “chosen”
Cultural gapPublic incidentsRLHF alignment + filter
Tashkeel gapTTS pipelineTashkeel layer or fine-tune
Code-switchingIntent + repliesCode-switched eval set
Tokenizer efficiencyCost + contextArabic-adapted tokenizer

How we help at Annota8

We don’t build foundation models — we produce the data that makes their commercial performance hold up. Our QA tier is built on PhD-level linguists in Cairo who annotate SFT/DPO natively in Arabic, build code-switched evals, and run religious-cultural red-teaming. That’s the difference between a model that tops a leaderboard and a model that ships in a Saudi bank without a support ticket in the first week.

Discuss your Arabic model diagnosis → 30-min session Read our Arabic NLP eval methodology

References

Footnotes

  1. Arabic LLMs referenced — ALLaM (SDAIA/NCAI, 7B/13B/34B/70B): https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview and https://arxiv.org/abs/2508.17378 ; Jais (Inception/MBZUAI/Cerebras, 13B): https://huggingface.co/inceptionai/jais-13b ; Fanar (QCRI, 9B/27B): https://arxiv.org/pdf/2501.13944 and https://huggingface.co/QCRI/Fanar-2-27B-Instruct ; Falcon Arabic (TII, May 2025): https://falconllm.tii.ae/falcon-arabic.html ; Karnak (AIC Egypt/ITIDA, Feb 2026, 30B–80B): https://huggingface.co/Applied-Innovation-Center/Karnak ; FanarGuard moderation component: https://qcai.qcri.org/uncategorized/fanar/ 2

  2. Koto et al., “ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic” (ACL 2024 Findings). 40 tasks, 14,575 MCQs constructed natively from regional school exams across North Africa, Levant, and Gulf. https://arxiv.org/abs/2402.12840 ; https://aclanthology.org/2024.findings-acl.334/ ; https://github.com/mbzuai-nlp/ArabicMMLU 2

  3. Dialect / code-switching evaluation gap — DialectalArabicMMLU (arXiv:2510.27543) was created precisely because mainstream Arabic benchmarks under-evaluate dialect; UI-level evaluation of ALLaM 34B reports dialect fidelity (4.21/5) below MSA (4.74/5). See also AL-QASIDA and “A Review of Arabic Post-Training Datasets and Their Limitations”. https://arxiv.org/pdf/2510.27543 ; https://arxiv.org/abs/2508.17378 ; https://arxiv.org/html/2507.14688v2 ; https://arxiv.org/html/2412.04193

  4. Machine-translated SFT recipes — “A Review of Arabic Post-Training Datasets and Their Limitations” (arXiv:2507.14688) and AceGPT methodology document that Arabic SFT pipelines commonly rely on translated English instruction sets (e.g., AceGPT translated Alpaca via GPT-4) with documented quality artifacts. https://arxiv.org/html/2507.14688v2

  5. Arabic TTS / tashkeel requirement — multiple Arabic TTS systems and pipelines explicitly require or strongly benefit from tashkeel input (Telnyx NaturalHD, Arabic F5-TTS-v2, Sadeed, CATT). https://telnyx.com/resources/arabic-tts-with-tashkeel-support ; https://huggingface.co/IbrahimSalah/Arabic-F5-TTS-v2 ; https://arxiv.org/html/2504.21635v1 ; https://arxiv.org/html/2407.03236v1

  6. Arabic tokenizer fertility / cost overhead — “The Token Tax” (arXiv:2509.05486) and AraToken (arXiv:2512.18399) document that English-first tokenizers produce substantially more tokens per Arabic word than per English word; AraToken reports ~1.2 tokens/word as a normalized baseline. https://arxiv.org/html/2509.05486v1 ; https://arxiv.org/pdf/2512.18399 ; https://arxiv.org/pdf/2106.07540 2

  7. Arabic-adapted tokenizers — ALLaM uses vocabulary expansion for Arabic morphology; Jais uses an Arabic-tuned BPE; Fanar 2.0 continually pretrains on Gemma with Arabic-centric vocabulary adaptation. https://huggingface.co/ALLaM-AI/ALLaM-7B-Instruct-preview ; https://huggingface.co/inceptionai/jais-13b ; https://qcai.qcri.org/uncategorized/fanar/