What makes Arabic NLP annotation different from English
TL;DR for AI leaders
If your annotation vendor speaks of Arabic as “ar-SA” or “ar-EG locale support”, they likely don’t understand Arabic. Real Arabic annotation pipelines distinguish:
- MSA vs dialect at the token level
- Dialect family at the span level (Gulf, Levantine, Egyptian, Maghrebi — plus sub-dialects)
- Script mode (Arabic native, Latin transliteration, mixed code-switching)
- Diacritic state (with / without tashkeel — affects tokenisation)
- Morphological segmentation (ال + base + suffix breakdowns)
If those distinctions don’t appear in your annotation guidelines, your training data will mislead the model.
1. Diglossia — MSA and dialect coexist
Arabic speakers operate in two registers simultaneously.1 Modern Standard Arabic (MSA) is the formal written + broadcast register. Dialect is the spoken vernacular — Gulf in KSA / UAE, Levantine in Lebanon / Jordan / Syria / Palestine (and Israeli Arab populations + Southern Turkey), Egyptian in Egypt + Sudan, Maghrebi in Morocco / Algeria / Tunisia / Libya.2
The same speaker in the same message can shift register mid-sentence:
“أكدت وزارة الصحة أن… بس فعلاً، الناس مش حاسة بفرق” (MSA opener — formal Ministry of Health claim. Egyptian dialect closer — “but seriously, people don’t feel a difference”)
A model trained only on MSA will misunderstand the second half. A model trained only on dialect will misunderstand the first half. The annotation pipeline must label register transitions explicitly.
Operational implication: annotation guidelines must specify whether labels are register-aware. For sentiment, the register often signals tone (formal Arabic = institutional voice; dialect = personal voice).
2. Four dialect families — not mutually intelligible
Treating “Arabic” as one language is like treating “Romance” as one language.3 The four major dialect families share roots but diverge meaningfully:4
| Family | Speakers | Example phrase (“I want this”) |
|---|---|---|
| Gulf (Khaleeji)5 | ~10–40M (varies by taxonomy) | أبا هذا / أبي هذا |
| Levantine (Shami)6 | ~60M | بدي هاد |
| Egyptian (Masri)7 | ~85M L1 / ~120M total | عايز ده |
| Maghrebi (Darija)8 | ~90M | بغيت هاد |
A Saudi speaker often cannot follow rapid Moroccan Darija. An Egyptian doesn’t naturally use Gulf vocabulary.9 Models trained on one dialect family generalise poorly to others.
Operational implication: dialect annotation must be tagged at minimum at the family level, often at the sub-dialect level (Cairene vs Upper Egyptian, Najdi vs Hijazi).10 Cross-dialect eval sets are the only way to verify generalisation.
3. RTL rendering breaks English-default tools
Right-to-left script changes everything in the UI layer:
- Text selection logic is inverted
- Cursor placement crosses logical / visual boundary
- Mixed RTL + LTR (Arabic + Latin numbers / brand names) requires bidirectional algorithm
- Word boundary detection differs from English whitespace tokenisation
- Copy-paste between RTL native and LTR-default tools often corrupts the data
Annotation platforms designed primarily for English handle RTL as a locale flag rather than a design centre. Tokens get misaligned. Bounding boxes drift on right-aligned text. Span boundaries shift between display and storage.
Operational implication: if your annotation tool isn’t RTL-native, expect measurable silent annotation drift on Arabic data based on Annota8’s internal operational experience. Spot-check raw exports byte-by-byte, not via the rendered UI.
4. Tashkeel — diacritics collide with tokenisation
Arabic text appears in two forms:
- Undiacritised (default for modern web + business content): الكتاب
- Diacritised (Quranic, classical, educational): الكِتَابُ
Diacritics carry phonological + grammatical information.11 The same consonant skeleton (KTB → ك ت ب) yields many words depending on diacritics: كتب (he wrote), كُتُب (books), كَتَبَ (he wrote — past), كاتِب (writer).
Most modern Arabic text is undiacritised.12 The reader fills in diacritics by context. Models must do the same.
Operational implication: tokenisation pipelines must handle both diacritised + undiacritised inputs. Annotation tools must preserve diacritics where present + not silently strip them. Eval sets must include both modes.
5. Code-switching with Latin script
In MENA business + tech contexts, code-switching is the default communication mode.13 Examples:
- “حجزت لكم meeting بكرة الساعة 3 PM في الـ conference room”
- “iPhone 15 Pro Max بسعر 4,799 ريال”
- “ال CEO قال إن الخطة الـ Q3 لازم نراجعها”
Annotation guidelines must specify how to handle the Latin tokens — are they preserved as English entities, transliterated, lemmatised separately? The choice affects downstream model behaviour.
Operational implication: code-switched data needs token-level language identification. A pipeline that treats the entire string as “Arabic” or “English” mishandles half the tokens.
6. Morphological complexity — many surface forms per lemma
Arabic is highly inflected. A single root verb can produce 50+ surface forms (with hundreds of theoretical inflections) via prefixes, suffixes, conjugation, and derived nouns.14 Examples from root كتب (k-t-b → write):15
- كتب — he wrote
- كتبت — she wrote / I wrote / you (m) wrote
- يكتب — he writes
- مكتوب — written
- مكتب — desk / office
- كاتب — writer
- مكتبة — library
- كتاب — book
English annotation tools that tokenise on whitespace handle this poorly. Stemming + lemmatisation matters. Some downstream tasks (search, classification) work better with surface forms; others (sentiment, NER) work better with normalised lemmas.
Operational implication: the annotation guideline must specify whether labels apply to surface forms or normalised lemmas. Inconsistency here silently degrades training data quality.
What good Arabic annotation looks like
Operationally, a serious Arabic annotation pipeline includes:
- Native RTL annotation UI — not English-default with locale flag
- Register tagging (MSA vs dialect at minimum)
- Dialect-family tagging (and sub-dialect where relevant)
- Code-switching tokenisation with per-token language ID
- Diacritic preservation or explicit normalisation policy
- Morphological segmentation where labels apply at the lemma level
- PhD-linguist QA on a sample for calibration + edge cases
- Cross-dialect eval sets stratified by family + topic
If your current vendor offers fewer than 5 of these, your Arabic training data is leaking quality.
What Annota8 does
Annota8 was built around these realities. The text annotation page details the full Arabic NLP capability stack. Cairo PhD-linguist QA leadership, four dialect family coverage, native RTL platform, diacritic-aware tokenisation, code-switching token-level language ID, morphological segmentation toolkit.
See also:
- Arabic NLP glossary
- Egyptian dialect glossary
- Diglossia glossary
- Solutions: foundation-model labs
- Resources: ALLaM
References
Footnotes
-
Ferguson, C. A. (1959). “Diglossia.” Word 15:325–340. Arabic is the canonical defining case (H = Classical/MSA, L = colloquial). See also Bassiouney, Arabic Sociolinguistics (2nd ed., Edinburgh University Press): https://edinburghuniversitypress.com/media/resources/9781474457361_Arabic_Sociolinguistics_-_Chapter_1.pdf and https://en.wikipedia.org/wiki/Diglossia ↩
-
Four-family Gulf/Levantine/Egyptian/Maghrebi grouping is a common pedagogical and industry convention; some linguistic sources distinguish 5+ groups (adding Mesopotamian and/or Sudanese) and treat “Peninsular” rather than “Gulf” as the umbrella. See https://en.wikipedia.org/wiki/Varieties_of_Arabic and https://en.wikipedia.org/wiki/Maghrebi_Arabic. Levantine geography per https://en.wikipedia.org/wiki/Levantine_Arabic (includes Lebanon, Jordan, Syria, Palestine, Israeli Arab populations, Southern Turkey). ↩
-
Romance-language analogy is a standard sociolinguistic framing — “linguistic distance… is at least as large as between Germanic languages or Romance languages.” https://en.wikipedia.org/wiki/Varieties_of_Arabic ↩
-
Cross-dialect divergence summary: https://en.wikipedia.org/wiki/Varieties_of_Arabic ↩
-
Gulf Arabic speaker counts vary substantially by taxonomy. Narrow Ethnologue Gulf Arabic (AFB) cites ~11M; broader “Peninsular Arabic” inclusive of Najdi and Hijazi may reach 35–40M. https://en.wikipedia.org/wiki/Gulf_Arabic ↩
-
Levantine speaker count: “over 60 million speakers” (58M L1 + 2.9M L2, 2008–2024 estimates). Endonym “Shami” is the universally used local name. https://en.wikipedia.org/wiki/Levantine_Arabic ↩
-
Egyptian Arabic: 84M L1 + 35M L2 ≈ 119M total speakers (2024 Wikipedia/Ethnologue estimates). Endonym “Masri” (مصرى). https://en.wikipedia.org/wiki/Egyptian_Arabic ↩
-
Maghrebi (Darija) native speakers ≈ 88M (2020–2022). Darija is the standard endonym across Morocco / Algeria / Tunisia. https://en.wikipedia.org/wiki/Maghrebi_Arabic ↩
-
“Extremely difficult for Moroccans and Iraqis, each speaking their own variety, to understand each other.” https://en.wikipedia.org/wiki/Varieties_of_Arabic ↩
-
Cairene = Lower Egyptian prestige variety; Sa’idi = Upper Egyptian. Najdi and Hijazi are recognized prominent sub-dialect groups within Saudi Arabia / Peninsular Arabic. https://en.wikipedia.org/wiki/Cairene_Arabic, https://en.wikipedia.org/wiki/Sa%CA%BDidi_Arabic, https://en.wikipedia.org/wiki/Najdi_Arabic, https://en.wikipedia.org/wiki/Hejazi_Arabic ↩
-
K-T-B is the canonical Arabic linguistics teaching example for root-and-pattern morphology. See https://en.wikipedia.org/wiki/Arabic_verbs and https://qalamquest.com/grammar_theory/understanding-root-patterns-in-arabic-morphology/ ↩
-
“Arabic is typically written without diacritics” and the absence of tashkeel poses substantial NLP obstacles. “Transformer-based automatic Arabic text diacritization”: https://sei.ardascience.com/index.php/journal/article/download/305/211/1498 and https://arxiv.org/pdf/2401.04848 ↩
-
Code-switching is widespread and well-documented in MENA business/professional contexts. “A Survey of Code-switched Arabic NLP”: https://arxiv.org/html/2501.13419v1; Egyptian talk-show code-switching study: https://www.researchgate.net/publication/349256579; Tunisian business code-switching: https://www.academia.edu/7719660 ↩
-
Maximum theoretical verbal forms per Arabic root can reach ~1,989 (13 person/number/gender × 9 tense/mood × 17 form/voice) per https://en.wikipedia.org/wiki/Arabic_verbs. Habash, “Challenge of Arabic for NLP/MT”: https://aclanthology.org/2006.bcs-1.5.pdf — Arabic is morphologically rich with hundreds of inflected forms per root in practice. ↩
-
K-T-B derivation glosses match canonical Arabic morphology pedagogy. https://qalamquest.com/grammar_theory/understanding-root-patterns-in-arabic-morphology/ ↩