All posts

What makes Arabic NLP annotation different from English

TL;DR for AI leaders

If your annotation vendor speaks of Arabic as “ar-SA” or “ar-EG locale support”, they likely don’t understand Arabic. Real Arabic annotation pipelines distinguish:

  1. MSA vs dialect at the token level
  2. Dialect family at the span level (Gulf, Levantine, Egyptian, Maghrebi — plus sub-dialects)
  3. Script mode (Arabic native, Latin transliteration, mixed code-switching)
  4. Diacritic state (with / without tashkeel — affects tokenisation)
  5. Morphological segmentation (ال + base + suffix breakdowns)

If those distinctions don’t appear in your annotation guidelines, your training data will mislead the model.

1. Diglossia — MSA and dialect coexist

Arabic speakers operate in two registers simultaneously.1 Modern Standard Arabic (MSA) is the formal written + broadcast register. Dialect is the spoken vernacular — Gulf in KSA / UAE, Levantine in Lebanon / Jordan / Syria / Palestine (and Israeli Arab populations + Southern Turkey), Egyptian in Egypt + Sudan, Maghrebi in Morocco / Algeria / Tunisia / Libya.2

The same speaker in the same message can shift register mid-sentence:

“أكدت وزارة الصحة أن… بس فعلاً، الناس مش حاسة بفرق” (MSA opener — formal Ministry of Health claim. Egyptian dialect closer — “but seriously, people don’t feel a difference”)

A model trained only on MSA will misunderstand the second half. A model trained only on dialect will misunderstand the first half. The annotation pipeline must label register transitions explicitly.

Operational implication: annotation guidelines must specify whether labels are register-aware. For sentiment, the register often signals tone (formal Arabic = institutional voice; dialect = personal voice).

2. Four dialect families — not mutually intelligible

Treating “Arabic” as one language is like treating “Romance” as one language.3 The four major dialect families share roots but diverge meaningfully:4

FamilySpeakersExample phrase (“I want this”)
Gulf (Khaleeji)5~10–40M (varies by taxonomy)أبا هذا / أبي هذا
Levantine (Shami)6~60Mبدي هاد
Egyptian (Masri)7~85M L1 / ~120M totalعايز ده
Maghrebi (Darija)8~90Mبغيت هاد

A Saudi speaker often cannot follow rapid Moroccan Darija. An Egyptian doesn’t naturally use Gulf vocabulary.9 Models trained on one dialect family generalise poorly to others.

Operational implication: dialect annotation must be tagged at minimum at the family level, often at the sub-dialect level (Cairene vs Upper Egyptian, Najdi vs Hijazi).10 Cross-dialect eval sets are the only way to verify generalisation.

3. RTL rendering breaks English-default tools

Right-to-left script changes everything in the UI layer:

Annotation platforms designed primarily for English handle RTL as a locale flag rather than a design centre. Tokens get misaligned. Bounding boxes drift on right-aligned text. Span boundaries shift between display and storage.

Operational implication: if your annotation tool isn’t RTL-native, expect measurable silent annotation drift on Arabic data based on Annota8’s internal operational experience. Spot-check raw exports byte-by-byte, not via the rendered UI.

4. Tashkeel — diacritics collide with tokenisation

Arabic text appears in two forms:

Diacritics carry phonological + grammatical information.11 The same consonant skeleton (KTB → ك ت ب) yields many words depending on diacritics: كتب (he wrote), كُتُب (books), كَتَبَ (he wrote — past), كاتِب (writer).

Most modern Arabic text is undiacritised.12 The reader fills in diacritics by context. Models must do the same.

Operational implication: tokenisation pipelines must handle both diacritised + undiacritised inputs. Annotation tools must preserve diacritics where present + not silently strip them. Eval sets must include both modes.

5. Code-switching with Latin script

In MENA business + tech contexts, code-switching is the default communication mode.13 Examples:

Annotation guidelines must specify how to handle the Latin tokens — are they preserved as English entities, transliterated, lemmatised separately? The choice affects downstream model behaviour.

Operational implication: code-switched data needs token-level language identification. A pipeline that treats the entire string as “Arabic” or “English” mishandles half the tokens.

6. Morphological complexity — many surface forms per lemma

Arabic is highly inflected. A single root verb can produce 50+ surface forms (with hundreds of theoretical inflections) via prefixes, suffixes, conjugation, and derived nouns.14 Examples from root كتب (k-t-b → write):15

English annotation tools that tokenise on whitespace handle this poorly. Stemming + lemmatisation matters. Some downstream tasks (search, classification) work better with surface forms; others (sentiment, NER) work better with normalised lemmas.

Operational implication: the annotation guideline must specify whether labels apply to surface forms or normalised lemmas. Inconsistency here silently degrades training data quality.

What good Arabic annotation looks like

Operationally, a serious Arabic annotation pipeline includes:

  1. Native RTL annotation UI — not English-default with locale flag
  2. Register tagging (MSA vs dialect at minimum)
  3. Dialect-family tagging (and sub-dialect where relevant)
  4. Code-switching tokenisation with per-token language ID
  5. Diacritic preservation or explicit normalisation policy
  6. Morphological segmentation where labels apply at the lemma level
  7. PhD-linguist QA on a sample for calibration + edge cases
  8. Cross-dialect eval sets stratified by family + topic

If your current vendor offers fewer than 5 of these, your Arabic training data is leaking quality.

What Annota8 does

Annota8 was built around these realities. The text annotation page details the full Arabic NLP capability stack. Cairo PhD-linguist QA leadership, four dialect family coverage, native RTL platform, diacritic-aware tokenisation, code-switching token-level language ID, morphological segmentation toolkit.

See also:

Discuss your Arabic AI workload → 30-min session Read text annotation overview

References

Footnotes

  1. Ferguson, C. A. (1959). “Diglossia.” Word 15:325–340. Arabic is the canonical defining case (H = Classical/MSA, L = colloquial). See also Bassiouney, Arabic Sociolinguistics (2nd ed., Edinburgh University Press): https://edinburghuniversitypress.com/media/resources/9781474457361_Arabic_Sociolinguistics_-_Chapter_1.pdf and https://en.wikipedia.org/wiki/Diglossia

  2. Four-family Gulf/Levantine/Egyptian/Maghrebi grouping is a common pedagogical and industry convention; some linguistic sources distinguish 5+ groups (adding Mesopotamian and/or Sudanese) and treat “Peninsular” rather than “Gulf” as the umbrella. See https://en.wikipedia.org/wiki/Varieties_of_Arabic and https://en.wikipedia.org/wiki/Maghrebi_Arabic. Levantine geography per https://en.wikipedia.org/wiki/Levantine_Arabic (includes Lebanon, Jordan, Syria, Palestine, Israeli Arab populations, Southern Turkey).

  3. Romance-language analogy is a standard sociolinguistic framing — “linguistic distance… is at least as large as between Germanic languages or Romance languages.” https://en.wikipedia.org/wiki/Varieties_of_Arabic

  4. Cross-dialect divergence summary: https://en.wikipedia.org/wiki/Varieties_of_Arabic

  5. Gulf Arabic speaker counts vary substantially by taxonomy. Narrow Ethnologue Gulf Arabic (AFB) cites ~11M; broader “Peninsular Arabic” inclusive of Najdi and Hijazi may reach 35–40M. https://en.wikipedia.org/wiki/Gulf_Arabic

  6. Levantine speaker count: “over 60 million speakers” (58M L1 + 2.9M L2, 2008–2024 estimates). Endonym “Shami” is the universally used local name. https://en.wikipedia.org/wiki/Levantine_Arabic

  7. Egyptian Arabic: 84M L1 + 35M L2 ≈ 119M total speakers (2024 Wikipedia/Ethnologue estimates). Endonym “Masri” (مصرى). https://en.wikipedia.org/wiki/Egyptian_Arabic

  8. Maghrebi (Darija) native speakers ≈ 88M (2020–2022). Darija is the standard endonym across Morocco / Algeria / Tunisia. https://en.wikipedia.org/wiki/Maghrebi_Arabic

  9. “Extremely difficult for Moroccans and Iraqis, each speaking their own variety, to understand each other.” https://en.wikipedia.org/wiki/Varieties_of_Arabic

  10. Cairene = Lower Egyptian prestige variety; Sa’idi = Upper Egyptian. Najdi and Hijazi are recognized prominent sub-dialect groups within Saudi Arabia / Peninsular Arabic. https://en.wikipedia.org/wiki/Cairene_Arabic, https://en.wikipedia.org/wiki/Sa%CA%BDidi_Arabic, https://en.wikipedia.org/wiki/Najdi_Arabic, https://en.wikipedia.org/wiki/Hejazi_Arabic

  11. K-T-B is the canonical Arabic linguistics teaching example for root-and-pattern morphology. See https://en.wikipedia.org/wiki/Arabic_verbs and https://qalamquest.com/grammar_theory/understanding-root-patterns-in-arabic-morphology/

  12. “Arabic is typically written without diacritics” and the absence of tashkeel poses substantial NLP obstacles. “Transformer-based automatic Arabic text diacritization”: https://sei.ardascience.com/index.php/journal/article/download/305/211/1498 and https://arxiv.org/pdf/2401.04848

  13. Code-switching is widespread and well-documented in MENA business/professional contexts. “A Survey of Code-switched Arabic NLP”: https://arxiv.org/html/2501.13419v1; Egyptian talk-show code-switching study: https://www.researchgate.net/publication/349256579; Tunisian business code-switching: https://www.academia.edu/7719660

  14. Maximum theoretical verbal forms per Arabic root can reach ~1,989 (13 person/number/gender × 9 tense/mood × 17 form/voice) per https://en.wikipedia.org/wiki/Arabic_verbs. Habash, “Challenge of Arabic for NLP/MT”: https://aclanthology.org/2006.bcs-1.5.pdf — Arabic is morphologically rich with hundreds of inflected forms per root in practice.

  15. K-T-B derivation glosses match canonical Arabic morphology pedagogy. https://qalamquest.com/grammar_theory/understanding-root-patterns-in-arabic-morphology/