All posts

Hejazi vs Najdi Arabic NLP: the Saudi-internal depth most vendors miss

What “Saudi Arabic” actually contains

Saudi Arabia is the size of Western Europe. Treating its Arabic as one variety is the equivalent of treating “European Romance” as one language. The four major regional clusters that any commercial NLP pipeline serving the Kingdom has to handle separately:

ClusterMajor citiesApproximate speaker baseSub-dialect families inside
HejaziJeddah, Mecca, Madinah, Taif, Yanbu~8-10MUrban Hejazi, Old (Mecca/Madinah) Hejazi, Bedouin Hejazi
NajdiRiyadh, Qassim, Hail, Sudair~12-14MCentral (Riyadh urban), Northern Najdi (Hail), Qassimi, Bedouin Najdi
Sharqiyah (Eastern)Dammam, Khobar, Hofuf, Jubail~5MUrban Eastern (close to Bahraini/Kuwaiti Gulf), Hasawi (Hofuf-area), and the locally distinct community varieties present in the region
SouthernAbha, Khamis Mushait, Jizan, Najran~4-5MAsiri proper, Tihami, Jizani, Najrani (transitions toward Yemeni Arabic)

Speaker base figures are working estimates and overlap heavily — many speakers move between varieties depending on the addressee.

Phonology: where the dialect borders are loudest

The single most-cited contrast — the realization of the letter ق — is the easiest entry point.

Note: ق is realized as /g/ in both Hejazi and Najdi — the major Saudi-internal contrast is on ج, ك, lexicon, and intonation, not on ق. The bullets below emphasise agreement on ق and contrast on the other segments.

The letter ج is the second giveaway:

The letter ك adds a third layer that almost never gets handled:

For an ASR system, these are not minor accent variants. They are different segments at the acoustic-model level. A model that has only learned /q/ for ق (from MSA-heavy training data) will systematically mis-recognize Najdi/Hejazi galb (heart, with /g/) — the canonical demo failure I have watched produce more demo-room embarrassment than any other single error. The fix is dialect-stratified training data with /g/-realization heavily represented, not more MSA hours.

Lexicon: same meaning, different word

A short reference table that lets a buyer eyeball how much surface drift sits between the major varieties.

ConceptMSANajdiUrban HejaziSharqiyahAsiri
”now”الآن (al-ān)الحين (al-ḥīn)دلحين / دحين (dalḥīn / daḥīn) — and increasingly Egyptian-borrowed دلوقت (dilwaqt) among media-exposed speakersالحين (al-ḥīn)الحين / ذحين
”I want”أريد (urīd)أبغى (abḡā) / أبي (abī)[^7]أبغى (abḡā) / أبا (abā)[^7]أبا (abā) / أبي (abī)أبا (abā)
“good”جيّد (jayyid)زين (zēn)[^8]كويّس (kuwayyis) — Egyptian loan now mainstream[^8]زين (zēn)زين (zēn)
“how are you”كيف الحال (kayf al-ḥāl)كيفك / شلونك (shlōnak)كيف حالك / إزّيّك (izzayyak — Egyptian-borrowed)شلونك (shlōnak)كيف حالك
”boy”ولد (walad)ولد (walad)واد (wād) / صبي (ṣabī)ولد (walad)عيّل (ʿayyel)
“money”مال (māl)فلوس (flūs) / دراهمفلوس (flūs) / مصاري (maṣārī — Egyptian)فلوس (flūs)فلوس (flūs)
“car”سيّارة (sayyāra)سيّارة / موتر (mōtar)عربيّة (ʿarabiyya — Egyptian) / سيّارةسيّارةسيّارة
”no”لا (lā)لا / مَ (ma)لا / ما (ma) / مو (mu)لا / مالا

Two patterns leap out. First, the urban Hejazi column carries heavy Egyptian-Arabic loan presence — a consequence of a century of Egyptian media saturation plus the Hejaz’s historical role as a cosmopolitan pilgrimage corridor. Second, Najdi and Sharqiyah share much of their lexical core with each other (and with the wider Gulf varieties of Kuwait, Bahrain, and Qatar), while Hejazi sits as a partial outlier.

A sentiment classifier trained on Najdi-heavy Twitter data and asked to label Hejazi product reviews will read the Egyptian-borrowed vocabulary as out-of-distribution, drop confidence, and default to neutral. We see this in evaluation runs repeatedly.

Morphology: where the model breaks silently

Phonology mismatches at least produce visibly wrong transcripts. Morphology mismatches produce transcripts that look right and mean the wrong thing.

The negation system is the cleanest example.

The Saudi-internal negation systems are actually quite similar on the construction side — ma + verb is shared across Najdi, Hejazi, Sharqiyah, and Asiri. The downstream sentiment-classifier failure mode is therefore not about the negation construction itself but about the lexicon being negated: a Najdi-trained sentiment model has not seen the Egyptian-borrowed Hejazi adjectives that get negated (e.g., kuwayyis, kida, ʿarabiyya), and so it treats the negated phrase as out-of-distribution and drops to neutral. Lexical OOV inside a familiar grammatical frame — that is the real failure mode.

Pronoun systems differ too. The second-person feminine suffix:

A voice-biometric pipeline that assumes a uniform pronoun morphology will mis-segment the trailing morpheme and degrade speaker-modeling features in subtle ways that show up as elevated false-accept rate on cross-region traffic.

What this does to commercial AI

ASR word-error rate

The following table reflects internal Annota8 benchmark observations from production speech baselines we evaluate for foundation-model and telco customers (Whisper-large-v3, ALLaM-derived speech stacks, the major cloud Arabic ASR APIs), on read-prompt + spontaneous conversational test sets. Public Arabic ASR benchmarks (e.g. Talafha et al. 2023 on N-shot Whisper) do not yet publish Saudi-internal cluster splits at this granularity, so these ranges should be read as our operational estimate, not a peer-reviewed number:

VarietyAnnota8 internal WER range on production baselines
Najdi (Riyadh urban)12-18%
Hejazi (Jeddah urban)18-25%
Sharqiyah (Dammam/Khobar urban)14-20%
Asiri / southern22-32%
Bedouin tribal varieties (any region)30%+

The Najdi advantage is, in our reading, not because Najdi is “simpler” — it is because Najdi-origin speakers tend to dominate the Saudi-government recordings that in turn dominate publicly available Saudi corpora, and the major baselines were trained on what was available. Hejazi sits worse because it is under-represented relative to its share of the population. Asiri sits worst because it is under-represented relative to anything. We treat the gap, not the absolute numbers, as the operationally important observation.

Sentiment and intent classification

A vendor that ships a single “Saudi Arabic” intent classifier — trained predominantly on Najdi data because that is where the public data lives — will silently degrade on Hejazi and Sharqiyah traffic. The degradation pattern repeats:

For aspect-based sentiment specifically — see our dialect-stratified sentiment breakdown — the Saudi-internal slicing matters as much as the cross-dialect (Saudi vs Egyptian vs Levantine) slicing the industry already talks about.

Voice-biometric fraud risk

This one is the most operationally severe. Voice-biometric enrollment typically happens once, at account opening. Subsequent verification happens dozens of times over the account’s life.

If a customer enrolls in Hejazi register (calling from home in Jeddah on a Friday) and verifies in Najdi-shifted register (calling from a work trip in Riyadh, switched register toward the addressee), an under-trained speaker-verification system reads the within-speaker variation as cross-speaker variation and rejects.

The inverse is worse. A model that has only learned Najdi-baseline speaker embeddings can mis-score Hejazi imposters as legitimate, because the model treats unfamiliar phonological patterns as identity-irrelevant noise. We have seen this produce documented false-accept events in commercial deployments — and it is the kind of failure mode that does not get published in a vendor datasheet.

The mitigation is dialect-stratified enrollment data and dialect-aware speaker-modeling features. The mitigation is not in any off-the-shelf cloud API today.

What Annota8 does about it

A short, concrete list of what our pipeline does differently on Saudi work specifically — not a sales pitch, just the operating shape.

  1. Riyadh + Jeddah workforce splits. Annotators in our Saudi network are tagged by city-of-residence + variety-of-fluency. Najdi audio routes to Najdi-fluent annotators, Hejazi audio routes to Jeddah-network annotators, and we maintain explicit headcount in both rather than treating it as one pool. (See our notes on the Riyadh + Cairo workforce split for the cost-and-sovereignty tradeoffs.)

  2. Dialect-stratified evaluation sets, not a single Saudi holdout. Every Saudi-customer evaluation set we build has per-variety F1/WER cells and a macro number. The macro number alone is what gets buyers in trouble.

  3. Cairo PhD-linguist tier with Saudi sub-dialect specialization. The adjudication and decision-log layer sits in our Cairo team, where Arabic linguistics PhDs are economically available — including specialists trained on specific Saudi varieties. See the Cairo PhD-linguist economic model for why this is structurally available to us in Egypt.

  4. Explicit code-switching tags. Every transcript carries token-level tags for variety identity — Hejazi-with-Egyptian-loanword versus Hejazi-with-MSA-borrowing versus pure Hejazi. Downstream models can route on this. Code-switching handling at the token level is the unit of work.

  5. Honest sub-dialect coverage maps shared with the customer. Where our coverage is thin (Bedouin tribal varieties, Najran-region Yemeni-transition speech) we say so on the spec sheet. Buying a “Saudi-complete” claim from a vendor that has not published a coverage map is buying air.

The honest limit

Even with the above, Annota8 does not yet have full Bedouin sub-dialect coverage. The Bedouin-origin varieties present across Najd, the Hejaz, and the southern regions each carry phonological and lexical features distinct from the urban varieties of the same region — a fact acknowledged across mainstream Arabic dialectology. Building production-grade ASR + sentiment for these requires fieldwork-grade annotator networks we are still expanding into. Today we mark Bedouin-origin speech as such in delivery and we explicitly do not claim production accuracy on it.

We mention this on purpose. A vendor that says “we cover everything” is either lying or unaware. Saying out loud what we don’t do yet is the same operational honesty that gets us asked back to the next quarter’s evaluation.

What this means for an AI buyer

If you are an AI lead at a MENA telco or contact-center operator running Saudi-customer traffic — the practical asks of any speech or foundation-model vendor before you sign:

The model that wins Saudi commercial deployments over the next two years will not be the biggest. It will be the one measured on this internal slicing — and willing to publish the per-variety table without asterisks.

Run a Saudi per-variety WER benchmark against your current vendor → 30-min session See how the foundation-model workflow handles this