Hejazi vs Najdi Arabic NLP: the Saudi-internal depth most vendors miss
What “Saudi Arabic” actually contains
Saudi Arabia is the size of Western Europe. Treating its Arabic as one variety is the equivalent of treating “European Romance” as one language. The four major regional clusters that any commercial NLP pipeline serving the Kingdom has to handle separately:
| Cluster | Major cities | Approximate speaker base | Sub-dialect families inside |
|---|---|---|---|
| Hejazi | Jeddah, Mecca, Madinah, Taif, Yanbu | ~8-10M | Urban Hejazi, Old (Mecca/Madinah) Hejazi, Bedouin Hejazi |
| Najdi | Riyadh, Qassim, Hail, Sudair | ~12-14M | Central (Riyadh urban), Northern Najdi (Hail), Qassimi, Bedouin Najdi |
| Sharqiyah (Eastern) | Dammam, Khobar, Hofuf, Jubail | ~5M | Urban Eastern (close to Bahraini/Kuwaiti Gulf), Hasawi (Hofuf-area), and the locally distinct community varieties present in the region |
| Southern | Abha, Khamis Mushait, Jizan, Najran | ~4-5M | Asiri proper, Tihami, Jizani, Najrani (transitions toward Yemeni Arabic) |
Speaker base figures are working estimates and overlap heavily — many speakers move between varieties depending on the addressee.
Phonology: where the dialect borders are loudest
The single most-cited contrast — the realization of the letter ق — is the easiest entry point.
Note: ق is realized as /g/ in both Hejazi and Najdi — the major Saudi-internal contrast is on ج, ك, lexicon, and intonation, not on ق. The bullets below emphasise agreement on ق and contrast on the other segments.
- Najdi: ق → /g/ in nearly all positions[^2]. “I said” = gilt (قلت → قِلت / كِلت in everyday speech).
- Urban Hejazi: ق → /g/ is the defining reflex in everyday speech[^1] — Hejazi and Najdi both have /g/ in most positions, which is why traditional dialectology groups them together against the /q/-retaining and /ʔ/-fronting varieties (Cairene, Levantine). What separates Urban Hejazi from Najdi is not ق but ج, ك, lexicon, and intonation. (Speakers shift to /q/ in code-switched MSA or formal contexts — that is register, not dialect.)
- Old Hejazi (Mecca/Madinah): ق → /g/ in nearly all native lexicon, with /q/ retained only in MSA-borrowed religious and classical vocabulary.
- Sharqiyah: ق → /g/ across most varieties, with a frequent further shift to a palatalized realization (commonly rendered as the affricate [dʒ], i.e. an English “j” sound) in the conditioned environment of front vowels[^5] — garib “near” can become jarib in some speakers. The Eastern Province has several locally distinct community varieties; ASR teams should treat them as separate acoustic populations rather than one block.
- Asiri/Tihami: ق → /g/, with substantial retention of pharyngealization patterns documented for the southern highlands[^9].
The letter ج is the second giveaway:
- Najdi: ج → /dʒ/ (affricate, English “j”).
- Hejazi: ج → /dʒ/ in most environments, with [ʒ] (French “j”) realizations attested for some speakers and some lexical items[^1]. The sociolinguistic causation — often anecdotally attributed to media exposure — is not something we treat as established without dedicated sociolinguistic fieldwork.
- Sharqiyah: ج → /j/ (palatal, English “y”) in conditioned environments — a feature shared with Kuwaiti and Bahraini varieties[^6].
The letter ك adds a third layer that almost never gets handled:
- Najdi: ك → /tʃ/ (“ch”) when preceding a front vowel (the kashkasha feature). “Your house” baytich (feminine addressee) and baytik (masculine) — the ch realization is canonical in casual Najdi and absent in MSA[^3].
- Hejazi: ك → /k/ retained. Baytik, baytich does not occur natively[^4].
For an ASR system, these are not minor accent variants. They are different segments at the acoustic-model level. A model that has only learned /q/ for ق (from MSA-heavy training data) will systematically mis-recognize Najdi/Hejazi galb (heart, with /g/) — the canonical demo failure I have watched produce more demo-room embarrassment than any other single error. The fix is dialect-stratified training data with /g/-realization heavily represented, not more MSA hours.
Lexicon: same meaning, different word
A short reference table that lets a buyer eyeball how much surface drift sits between the major varieties.
| Concept | MSA | Najdi | Urban Hejazi | Sharqiyah | Asiri |
|---|---|---|---|---|---|
| ”now” | الآن (al-ān) | الحين (al-ḥīn) | دلحين / دحين (dalḥīn / daḥīn) — and increasingly Egyptian-borrowed دلوقت (dilwaqt) among media-exposed speakers | الحين (al-ḥīn) | الحين / ذحين |
| ”I want” | أريد (urīd) | أبغى (abḡā) / أبي (abī)[^7] | أبغى (abḡā) / أبا (abā)[^7] | أبا (abā) / أبي (abī) | أبا (abā) |
| “good” | جيّد (jayyid) | زين (zēn)[^8] | كويّس (kuwayyis) — Egyptian loan now mainstream[^8] | زين (zēn) | زين (zēn) |
| “how are you” | كيف الحال (kayf al-ḥāl) | كيفك / شلونك (shlōnak) | كيف حالك / إزّيّك (izzayyak — Egyptian-borrowed) | شلونك (shlōnak) | كيف حالك |
| ”boy” | ولد (walad) | ولد (walad) | واد (wād) / صبي (ṣabī) | ولد (walad) | عيّل (ʿayyel) |
| “money” | مال (māl) | فلوس (flūs) / دراهم | فلوس (flūs) / مصاري (maṣārī — Egyptian) | فلوس (flūs) | فلوس (flūs) |
| “car” | سيّارة (sayyāra) | سيّارة / موتر (mōtar) | عربيّة (ʿarabiyya — Egyptian) / سيّارة | سيّارة | سيّارة |
| ”no” | لا (lā) | لا / مَ (ma) | لا / ما (ma) / مو (mu) | لا / ما | لا |
Two patterns leap out. First, the urban Hejazi column carries heavy Egyptian-Arabic loan presence — a consequence of a century of Egyptian media saturation plus the Hejaz’s historical role as a cosmopolitan pilgrimage corridor. Second, Najdi and Sharqiyah share much of their lexical core with each other (and with the wider Gulf varieties of Kuwait, Bahrain, and Qatar), while Hejazi sits as a partial outlier.
A sentiment classifier trained on Najdi-heavy Twitter data and asked to label Hejazi product reviews will read the Egyptian-borrowed vocabulary as out-of-distribution, drop confidence, and default to neutral. We see this in evaluation runs repeatedly.
Morphology: where the model breaks silently
Phonology mismatches at least produce visibly wrong transcripts. Morphology mismatches produce transcripts that look right and mean the wrong thing.
The negation system is the cleanest example.
- Najdi: ma + verb. Ma adri “I don’t know.” Ma abḡā “I don’t want.”
- Urban Hejazi: ma + verb for verbal negation; mu (مو) as the copular/predicate negator[^4]. Mu kwayyis “not good.” Despite heavy Egyptian lexical borrowing in Hejazi, the grammar of negation is conservative — Hejazi has resisted the Egyptian ma-…-sh circumfix and does not use mish natively (those are Egyptian/Levantine, not Hejazi).
- Sharqiyah: ma + verb dominates; mu + adjective (“not [adjective]”) common.
- Asiri: ma + verb; some conservative lam-style negators retained from earlier varieties.
The Saudi-internal negation systems are actually quite similar on the construction side — ma + verb is shared across Najdi, Hejazi, Sharqiyah, and Asiri. The downstream sentiment-classifier failure mode is therefore not about the negation construction itself but about the lexicon being negated: a Najdi-trained sentiment model has not seen the Egyptian-borrowed Hejazi adjectives that get negated (e.g., kuwayyis, kida, ʿarabiyya), and so it treats the negated phrase as out-of-distribution and drops to neutral. Lexical OOV inside a familiar grammatical frame — that is the real failure mode.
Pronoun systems differ too. The second-person feminine suffix:
- Najdi: -ich / -ik / -ish depending on phonological environment.
- Hejazi: -ik (no affrication).
- Sharqiyah: -ich / -ish (often palatalized).
A voice-biometric pipeline that assumes a uniform pronoun morphology will mis-segment the trailing morpheme and degrade speaker-modeling features in subtle ways that show up as elevated false-accept rate on cross-region traffic.
What this does to commercial AI
ASR word-error rate
The following table reflects internal Annota8 benchmark observations from production speech baselines we evaluate for foundation-model and telco customers (Whisper-large-v3, ALLaM-derived speech stacks, the major cloud Arabic ASR APIs), on read-prompt + spontaneous conversational test sets. Public Arabic ASR benchmarks (e.g. Talafha et al. 2023 on N-shot Whisper) do not yet publish Saudi-internal cluster splits at this granularity, so these ranges should be read as our operational estimate, not a peer-reviewed number:
| Variety | Annota8 internal WER range on production baselines |
|---|---|
| Najdi (Riyadh urban) | 12-18% |
| Hejazi (Jeddah urban) | 18-25% |
| Sharqiyah (Dammam/Khobar urban) | 14-20% |
| Asiri / southern | 22-32% |
| Bedouin tribal varieties (any region) | 30%+ |
The Najdi advantage is, in our reading, not because Najdi is “simpler” — it is because Najdi-origin speakers tend to dominate the Saudi-government recordings that in turn dominate publicly available Saudi corpora, and the major baselines were trained on what was available. Hejazi sits worse because it is under-represented relative to its share of the population. Asiri sits worst because it is under-represented relative to anything. We treat the gap, not the absolute numbers, as the operationally important observation.
Sentiment and intent classification
A vendor that ships a single “Saudi Arabic” intent classifier — trained predominantly on Najdi data because that is where the public data lives — will silently degrade on Hejazi and Sharqiyah traffic. The degradation pattern repeats:
- Hejazi reviews with Egyptian-borrowed vocabulary drift toward “neutral” because the model treats the Egyptian tokens as out-of-distribution.
- Sharqiyah community-specific religious phrasing gets misclassified as off-topic, because the community register doesn’t appear at training-time frequency.
- Asiri regional vocabulary triggers OOV-driven low confidence and dumps to the human fallback queue at 3-4x the rate of Najdi traffic — making the cost of running the system regionally uneven, which is a finance problem as well as an accuracy problem.
For aspect-based sentiment specifically — see our dialect-stratified sentiment breakdown — the Saudi-internal slicing matters as much as the cross-dialect (Saudi vs Egyptian vs Levantine) slicing the industry already talks about.
Voice-biometric fraud risk
This one is the most operationally severe. Voice-biometric enrollment typically happens once, at account opening. Subsequent verification happens dozens of times over the account’s life.
If a customer enrolls in Hejazi register (calling from home in Jeddah on a Friday) and verifies in Najdi-shifted register (calling from a work trip in Riyadh, switched register toward the addressee), an under-trained speaker-verification system reads the within-speaker variation as cross-speaker variation and rejects.
The inverse is worse. A model that has only learned Najdi-baseline speaker embeddings can mis-score Hejazi imposters as legitimate, because the model treats unfamiliar phonological patterns as identity-irrelevant noise. We have seen this produce documented false-accept events in commercial deployments — and it is the kind of failure mode that does not get published in a vendor datasheet.
The mitigation is dialect-stratified enrollment data and dialect-aware speaker-modeling features. The mitigation is not in any off-the-shelf cloud API today.
What Annota8 does about it
A short, concrete list of what our pipeline does differently on Saudi work specifically — not a sales pitch, just the operating shape.
-
Riyadh + Jeddah workforce splits. Annotators in our Saudi network are tagged by city-of-residence + variety-of-fluency. Najdi audio routes to Najdi-fluent annotators, Hejazi audio routes to Jeddah-network annotators, and we maintain explicit headcount in both rather than treating it as one pool. (See our notes on the Riyadh + Cairo workforce split for the cost-and-sovereignty tradeoffs.)
-
Dialect-stratified evaluation sets, not a single Saudi holdout. Every Saudi-customer evaluation set we build has per-variety F1/WER cells and a macro number. The macro number alone is what gets buyers in trouble.
-
Cairo PhD-linguist tier with Saudi sub-dialect specialization. The adjudication and decision-log layer sits in our Cairo team, where Arabic linguistics PhDs are economically available — including specialists trained on specific Saudi varieties. See the Cairo PhD-linguist economic model for why this is structurally available to us in Egypt.
-
Explicit code-switching tags. Every transcript carries token-level tags for variety identity — Hejazi-with-Egyptian-loanword versus Hejazi-with-MSA-borrowing versus pure Hejazi. Downstream models can route on this. Code-switching handling at the token level is the unit of work.
-
Honest sub-dialect coverage maps shared with the customer. Where our coverage is thin (Bedouin tribal varieties, Najran-region Yemeni-transition speech) we say so on the spec sheet. Buying a “Saudi-complete” claim from a vendor that has not published a coverage map is buying air.
The honest limit
Even with the above, Annota8 does not yet have full Bedouin sub-dialect coverage. The Bedouin-origin varieties present across Najd, the Hejaz, and the southern regions each carry phonological and lexical features distinct from the urban varieties of the same region — a fact acknowledged across mainstream Arabic dialectology. Building production-grade ASR + sentiment for these requires fieldwork-grade annotator networks we are still expanding into. Today we mark Bedouin-origin speech as such in delivery and we explicitly do not claim production accuracy on it.
We mention this on purpose. A vendor that says “we cover everything” is either lying or unaware. Saying out loud what we don’t do yet is the same operational honesty that gets us asked back to the next quarter’s evaluation.
What this means for an AI buyer
If you are an AI lead at a MENA telco or contact-center operator running Saudi-customer traffic — the practical asks of any speech or foundation-model vendor before you sign:
- Show me per-variety WER on a holdout that splits at least Najdi / Hejazi / Sharqiyah / Asiri.
- Show me the annotator network composition by variety — not just country.
- Show me your dialect identification confusion matrix between the four Saudi clusters. The number of vendors who can produce this is small.
- Tell me what you do not cover. Vendors who can name their gap usually have less of one.
The model that wins Saudi commercial deployments over the next two years will not be the biggest. It will be the one measured on this internal slicing — and willing to publish the per-variety table without asterisks.