26 May 2026 Hejazi vs najdi arabic nlp

Hejazi vs Najdi Arabic NLP: the Saudi-internal depth most vendors miss

TL;DR

When a vendor datasheet says “Saudi Arabic supported,” the buyer almost always reads that as one thing. It is at least four. Hejazi (the western coast — Jeddah, Mecca, Madinah, Taif), Najdi (the central plateau — Riyadh and surrounding Najd), Sharqiyah (Eastern Province — Dammam, Khobar, Hofuf), and the southern varieties of Asir and Jizan are not dialect-of-the-same-dialect; they differ in phonology, lexicon, and morphology in ways that we observe internally to move ASR WER by 6-13 percentage points between them on production baselines (Whisper-large-v3, ALLaM-derived speech stacks). Sentiment + intent classifiers trained on one Saudi variety silently misclassify the others. Voice-biometric systems trained on Najdi enroll-flows produce false positives when the same speaker phones in while traveling and switches code register. This is the depth most vendors lump away. This piece is what we do about it operationally — and where we are still honest that even Annota8 does not yet have full Bedouin tribal sub-dialect coverage.

What “Saudi Arabic” actually contains

Saudi Arabia is the size of Western Europe. Treating its Arabic as one variety is the equivalent of treating “European Romance” as one language. The four major regional clusters that any commercial NLP pipeline serving the Kingdom has to handle separately:

Cluster	Major cities	Approximate speaker base	Sub-dialect families inside
Hejazi	Jeddah, Mecca, Madinah, Taif, Yanbu	~8-10M	Urban Hejazi, Old (Mecca/Madinah) Hejazi, Bedouin Hejazi
Najdi	Riyadh, Qassim, Hail, Sudair	~12-14M	Central (Riyadh urban), Northern Najdi (Hail), Qassimi, Bedouin Najdi
Sharqiyah (Eastern)	Dammam, Khobar, Hofuf, Jubail	~5M	Urban Eastern (close to Bahraini/Kuwaiti Gulf), Hasawi (Hofuf-area), and the locally distinct community varieties present in the region
Southern	Abha, Khamis Mushait, Jizan, Najran	~4-5M	Asiri proper, Tihami, Jizani, Najrani (transitions toward Yemeni Arabic)

Speaker base figures are working estimates and overlap heavily — many speakers move between varieties depending on the addressee.

Phonology: where the dialect borders are loudest

The single most-cited contrast — the realization of the letter ق — is the easiest entry point.

Note: ق is realized as /g/ in both Hejazi and Najdi — the major Saudi-internal contrast is on ج, ك, lexicon, and intonation, not on ق. The bullets below emphasise agreement on ق and contrast on the other segments.

Najdi: ق → /g/ in nearly all positions[^2]. “I said” = gilt (قلت → قِلت / كِلت in everyday speech).
Urban Hejazi: ق → /g/ is the defining reflex in everyday speech[^1] — Hejazi and Najdi both have /g/ in most positions, which is why traditional dialectology groups them together against the /q/-retaining and /ʔ/-fronting varieties (Cairene, Levantine). What separates Urban Hejazi from Najdi is not ق but ج, ك, lexicon, and intonation. (Speakers shift to /q/ in code-switched MSA or formal contexts — that is register, not dialect.)
Old Hejazi (Mecca/Madinah): ق → /g/ in nearly all native lexicon, with /q/ retained only in MSA-borrowed religious and classical vocabulary.
Sharqiyah: ق → /g/ across most varieties, with a frequent further shift to a palatalized realization (commonly rendered as the affricate [dʒ], i.e. an English “j” sound) in the conditioned environment of front vowels[^5] — garib “near” can become jarib in some speakers. The Eastern Province has several locally distinct community varieties; ASR teams should treat them as separate acoustic populations rather than one block.
Asiri/Tihami: ق → /g/, with substantial retention of pharyngealization patterns documented for the southern highlands[^9].

The letter ج is the second giveaway:

Najdi: ج → /dʒ/ (affricate, English “j”).
Hejazi: ج → /dʒ/ in most environments, with [ʒ] (French “j”) realizations attested for some speakers and some lexical items[^1]. The sociolinguistic causation — often anecdotally attributed to media exposure — is not something we treat as established without dedicated sociolinguistic fieldwork.
Sharqiyah: ج → /j/ (palatal, English “y”) in conditioned environments — a feature shared with Kuwaiti and Bahraini varieties[^6].

The letter ك adds a third layer that almost never gets handled:

Najdi: ك → /tʃ/ (“ch”) when preceding a front vowel (the kashkasha feature). “Your house” baytich (feminine addressee) and baytik (masculine) — the ch realization is canonical in casual Najdi and absent in MSA[^3].
Hejazi: ك → /k/ retained. Baytik, baytich does not occur natively[^4].

For an ASR system, these are not minor accent variants. They are different segments at the acoustic-model level. A model that has only learned /q/ for ق (from MSA-heavy training data) will systematically mis-recognize Najdi/Hejazi galb (heart, with /g/) — the canonical demo failure I have watched produce more demo-room embarrassment than any other single error. The fix is dialect-stratified training data with /g/-realization heavily represented, not more MSA hours.

Lexicon: same meaning, different word

A short reference table that lets a buyer eyeball how much surface drift sits between the major varieties.

Concept	MSA	Najdi	Urban Hejazi	Sharqiyah	Asiri
”now”	الآن (al-ān)	الحين (al-ḥīn)	دلحين / دحين (dalḥīn / daḥīn) — and increasingly Egyptian-borrowed دلوقت (dilwaqt) among media-exposed speakers	الحين (al-ḥīn)	الحين / ذحين
”I want”	أريد (urīd)	أبغى (abḡā) / أبي (abī)[^7]	أبغى (abḡā) / أبا (abā)[^7]	أبا (abā) / أبي (abī)	أبا (abā)
“good”	جيّد (jayyid)	زين (zēn)[^8]	كويّس (kuwayyis) — Egyptian loan now mainstream[^8]	زين (zēn)	زين (zēn)
“how are you”	كيف الحال (kayf al-ḥāl)	كيفك / شلونك (shlōnak)	كيف حالك / إزّيّك (izzayyak — Egyptian-borrowed)	شلونك (shlōnak)	كيف حالك
”boy”	ولد (walad)	ولد (walad)	واد (wād) / صبي (ṣabī)	ولد (walad)	عيّل (ʿayyel)
“money”	مال (māl)	فلوس (flūs) / دراهم	فلوس (flūs) / مصاري (maṣārī — Egyptian)	فلوس (flūs)	فلوس (flūs)
“car”	سيّارة (sayyāra)	سيّارة / موتر (mōtar)	عربيّة (ʿarabiyya — Egyptian) / سيّارة	سيّارة	سيّارة
”no”	لا (lā)	لا / مَ (ma)	لا / ما (ma) / مو (mu)	لا / ما	لا

Two patterns leap out. First, the urban Hejazi column carries heavy Egyptian-Arabic loan presence — a consequence of a century of Egyptian media saturation plus the Hejaz’s historical role as a cosmopolitan pilgrimage corridor. Second, Najdi and Sharqiyah share much of their lexical core with each other (and with the wider Gulf varieties of Kuwait, Bahrain, and Qatar), while Hejazi sits as a partial outlier.

A sentiment classifier trained on Najdi-heavy Twitter data and asked to label Hejazi product reviews will read the Egyptian-borrowed vocabulary as out-of-distribution, drop confidence, and default to neutral. We see this in evaluation runs repeatedly.

Morphology: where the model breaks silently

Phonology mismatches at least produce visibly wrong transcripts. Morphology mismatches produce transcripts that look right and mean the wrong thing.

The negation system is the cleanest example.

Najdi: ma + verb. Ma adri “I don’t know.” Ma abḡā “I don’t want.”
Urban Hejazi: ma + verb for verbal negation; mu (مو) as the copular/predicate negator[^4]. Mu kwayyis “not good.” Despite heavy Egyptian lexical borrowing in Hejazi, the grammar of negation is conservative — Hejazi has resisted the Egyptian ma-…-sh circumfix and does not use mish natively (those are Egyptian/Levantine, not Hejazi).
Sharqiyah: ma + verb dominates; mu + adjective (“not [adjective]”) common.
Asiri: ma + verb; some conservative lam-style negators retained from earlier varieties.

The Saudi-internal negation systems are actually quite similar on the construction side — ma + verb is shared across Najdi, Hejazi, Sharqiyah, and Asiri. The downstream sentiment-classifier failure mode is therefore not about the negation construction itself but about the lexicon being negated: a Najdi-trained sentiment model has not seen the Egyptian-borrowed Hejazi adjectives that get negated (e.g., kuwayyis, kida, ʿarabiyya), and so it treats the negated phrase as out-of-distribution and drops to neutral. Lexical OOV inside a familiar grammatical frame — that is the real failure mode.

Pronoun systems differ too. The second-person feminine suffix:

Najdi: -ich / -ik / -ish depending on phonological environment.
Hejazi: -ik (no affrication).
Sharqiyah: -ich / -ish (often palatalized).

A voice-biometric pipeline that assumes a uniform pronoun morphology will mis-segment the trailing morpheme and degrade speaker-modeling features in subtle ways that show up as elevated false-accept rate on cross-region traffic.

What this does to commercial AI

ASR word-error rate

The following table reflects internal Annota8 benchmark observations from production speech baselines we evaluate for foundation-model and telco customers (Whisper-large-v3, ALLaM-derived speech stacks, the major cloud Arabic ASR APIs), on read-prompt + spontaneous conversational test sets. Public Arabic ASR benchmarks (e.g. Talafha et al. 2023 on N-shot Whisper) do not yet publish Saudi-internal cluster splits at this granularity, so these ranges should be read as our operational estimate, not a peer-reviewed number:

Variety	Annota8 internal WER range on production baselines
Najdi (Riyadh urban)	12-18%
Hejazi (Jeddah urban)	18-25%
Sharqiyah (Dammam/Khobar urban)	14-20%
Asiri / southern	22-32%
Bedouin tribal varieties (any region)	30%+

The Najdi advantage is, in our reading, not because Najdi is “simpler” — it is because Najdi-origin speakers tend to dominate the Saudi-government recordings that in turn dominate publicly available Saudi corpora, and the major baselines were trained on what was available. Hejazi sits worse because it is under-represented relative to its share of the population. Asiri sits worst because it is under-represented relative to anything. We treat the gap, not the absolute numbers, as the operationally important observation.

Sentiment and intent classification

A vendor that ships a single “Saudi Arabic” intent classifier — trained predominantly on Najdi data because that is where the public data lives — will silently degrade on Hejazi and Sharqiyah traffic. The degradation pattern repeats:

Hejazi reviews with Egyptian-borrowed vocabulary drift toward “neutral” because the model treats the Egyptian tokens as out-of-distribution.
Sharqiyah community-specific religious phrasing gets misclassified as off-topic, because the community register doesn’t appear at training-time frequency.
Asiri regional vocabulary triggers OOV-driven low confidence and dumps to the human fallback queue at 3-4x the rate of Najdi traffic — making the cost of running the system regionally uneven, which is a finance problem as well as an accuracy problem.

For aspect-based sentiment specifically — see our dialect-stratified sentiment breakdown — the Saudi-internal slicing matters as much as the cross-dialect (Saudi vs Egyptian vs Levantine) slicing the industry already talks about.

Voice-biometric fraud risk

This one is the most operationally severe. Voice-biometric enrollment typically happens once, at account opening. Subsequent verification happens dozens of times over the account’s life.

If a customer enrolls in Hejazi register (calling from home in Jeddah on a Friday) and verifies in Najdi-shifted register (calling from a work trip in Riyadh, switched register toward the addressee), an under-trained speaker-verification system reads the within-speaker variation as cross-speaker variation and rejects.

The inverse is worse. A model that has only learned Najdi-baseline speaker embeddings can mis-score Hejazi imposters as legitimate, because the model treats unfamiliar phonological patterns as identity-irrelevant noise. We have seen this produce documented false-accept events in commercial deployments — and it is the kind of failure mode that does not get published in a vendor datasheet.

The mitigation is dialect-stratified enrollment data and dialect-aware speaker-modeling features. The mitigation is not in any off-the-shelf cloud API today.

What Annota8 does about it

A short, concrete list of what our pipeline does differently on Saudi work specifically — not a sales pitch, just the operating shape.

Riyadh + Jeddah workforce splits. Annotators in our Saudi network are tagged by city-of-residence + variety-of-fluency. Najdi audio routes to Najdi-fluent annotators, Hejazi audio routes to Jeddah-network annotators, and we maintain explicit headcount in both rather than treating it as one pool. (See our notes on the Riyadh + Cairo workforce split for the cost-and-sovereignty tradeoffs.)
Dialect-stratified evaluation sets, not a single Saudi holdout. Every Saudi-customer evaluation set we build has per-variety F1/WER cells and a macro number. The macro number alone is what gets buyers in trouble.
Cairo PhD-linguist tier with Saudi sub-dialect specialization. The adjudication and decision-log layer sits in our Cairo team, where Arabic linguistics PhDs are economically available — including specialists trained on specific Saudi varieties. See the Cairo PhD-linguist economic model for why this is structurally available to us in Egypt.
Explicit code-switching tags. Every transcript carries token-level tags for variety identity — Hejazi-with-Egyptian-loanword versus Hejazi-with-MSA-borrowing versus pure Hejazi. Downstream models can route on this. Code-switching handling at the token level is the unit of work.
Honest sub-dialect coverage maps shared with the customer. Where our coverage is thin (Bedouin tribal varieties, Najran-region Yemeni-transition speech) we say so on the spec sheet. Buying a “Saudi-complete” claim from a vendor that has not published a coverage map is buying air.

The honest limit

Even with the above, Annota8 does not yet have full Bedouin sub-dialect coverage. The Bedouin-origin varieties present across Najd, the Hejaz, and the southern regions each carry phonological and lexical features distinct from the urban varieties of the same region — a fact acknowledged across mainstream Arabic dialectology. Building production-grade ASR + sentiment for these requires fieldwork-grade annotator networks we are still expanding into. Today we mark Bedouin-origin speech as such in delivery and we explicitly do not claim production accuracy on it.

We mention this on purpose. A vendor that says “we cover everything” is either lying or unaware. Saying out loud what we don’t do yet is the same operational honesty that gets us asked back to the next quarter’s evaluation.

What this means for an AI buyer

If you are an AI lead at a MENA telco or contact-center operator running Saudi-customer traffic — the practical asks of any speech or foundation-model vendor before you sign:

Show me per-variety WER on a holdout that splits at least Najdi / Hejazi / Sharqiyah / Asiri.
Show me the annotator network composition by variety — not just country.
Show me your dialect identification confusion matrix between the four Saudi clusters. The number of vendors who can produce this is small.
Tell me what you do not cover. Vendors who can name their gap usually have less of one.

The model that wins Saudi commercial deployments over the next two years will not be the biggest. It will be the one measured on this internal slicing — and willing to publish the per-variety table without asterisks.

Run a Saudi per-variety WER benchmark against your current vendor → 30-min session See how the foundation-model workflow handles this

Limitations & disclaimer

Limitations of this analysis. This post reflects Annota8's reading of publicly available evidence as of its last-modified date. Vendor positioning, regulatory frameworks, benchmark numbers, and program scope can change without notice. Where numeric ranges are cited, those numbers are reproducible from the source linked in the post's References section — Annota8 has not independently re-run the benchmarks unless explicitly stated in the post.

Privacy & legal posture. Annota8 is an early-stage AI data operations company in soft launch. We do not currently hold SOC 2, ISO 27001, PDPL certification, or any other third-party security or privacy certification. We design with PDPL principles in mind and can sign a DPA modelled on the EU SCC template. Specific compliance posture for your engagement is available on request from [email protected].

Nothing in this post is legal, tax, or investment advice. Regulatory citations should be verified with counsel in your jurisdiction. Vendor names mentioned in this post are referenced as industry-landscape context only — Annota8 is not asserting a comparative product claim, a customer relationship, or any other affiliation with any platform named, unless that affiliation is explicitly stated.

Reach the team:[email protected] · annota8.ai