All posts

Building Arabic dialect ASR — annotation lessons

Why Arabic dialect ASR is structurally hard

1. The publicly-available training corpus is MSA-skewed

Most Arabic-language audio publicly available for training is MSA — broadcast news, Quranic recitation, lectures, formal speeches. Spoken dialect is rare in scraped corpora.2

Models pretrained on this skewed corpus learn MSA well + perform poorly on dialect.

2. Dialect families are not mutually intelligible

Speakers from different Arabic dialect families often struggle to understand each other.1 From the model’s perspective, treating Egyptian + Gulf + Levantine + Maghrebi as one language is like treating Spanish + Italian + French + Romanian as one language.

3. Code-switching is common

In MENA tech + business contexts, code-switching with English (or French in the Maghreb) tokens is common. The ASR must handle:

4. Diglossia complicates orthographic transcription

Spoken dialect is rarely written in formal Arabic. Egyptian “إيه أخبارك؟” doesn’t appear in MSA literature.3 Transcription requires explicit dialect orthography conventions.

5. Phonetic variation is huge

Arabic dialect phonetic inventory differs across families:

Phonetic ground-truth annotation must account for this.

What good dialect ASR annotation looks like

Dialect-stratified data sourcing

Don’t lump dialects together. Source separately per family + sub-family:

Annotation guidelines must include dialect-family + sub-dialect tagging.

Phonetic transcription convention per dialect

Choose + document an orthographic convention per dialect. Two main approaches:

Mixing conventions across the corpus produces inconsistent ASR. Pick one + enforce it.

Code-switching token-level language ID

For mixed Arabic + Latin utterances, tag each token’s language identity. The ASR can then route Latin tokens to a different acoustic model + language model than Arabic tokens.

PhD-linguist transcription QA

For long-form transcription QA (eg. meeting transcripts, conversational AI training data), PhD-linguist QA materially moves the eval needle. Crowd-sourced transcription on Arabic dialect typically produces 5-15% error rates that compound at training time.8

Time-aligned segmentation

For dialect ASR, segment-level time alignment matters. Code-switching boundaries align with token boundaries, and word-level + phoneme-level alignment is the gold standard for serious dialect ASR.

Speaker turn + diarisation handling

For multi-speaker dialect data (interviews, podcasts, conversational AI training), speaker diarisation must be reliable. Manual speaker-turn validation is part of the QA layer for serious dialect ASR.

Annotation guideline anchors

Specific anchors that should appear in dialect ASR annotation guidelines:

  1. Dialect-family tag mandatory — per-utterance + per-segment
  2. Sub-dialect tag optional but encouraged for stratified eval
  3. Orthographic convention — explicit + documented + enforced
  4. Code-switching token language ID — per-token Arabic/Latin/numeric tag
  5. Disfluency handling — explicit policy on filled pauses, false starts, repairs
  6. Numbers + acronyms — explicit transcription policy (written-as-spoken vs normalised)
  7. Background noise + non-speech — explicit tagging conventions
  8. Overlap + crosstalk — explicit conventions

Without these, the training data quality drifts silently across annotators.

Eval set construction

Dialect-stratified eval sets are non-negotiable. Recommended composition:

Component% of evalPurpose
Per-dialect family hold-out5% per familyPer-family WER measurement
Code-switching hold-out5%Code-switching robustness
Heavy disfluency hold-out5%Real conversational robustness
MSA control10%MSA baseline comparison
Cross-dialect generalisation5%Train-on-A test-on-B robustness

Without stratified eval, you can’t see where the model fails.

Where Annota8 fits

Annota8 was built for Arabic dialect work. Capability stack:

See audio annotation modality for the full capability detail.

FAQ

How many hours of dialect audio do I need to train usable ASR?
Depends on baseline. Fine-tuning a pretrained multilingual model (Whisper-class) on dialect-stratified data gets to usable quality on the target dialect. Training from scratch needs thousands of hours.
Can I use translated MSA transcripts for dialect training?
Generally no. Translated MSA loses the dialect surface features the model needs to learn. Native dialect transcription is required.
What's the WER gap between MSA-trained and dialect-trained ASR on dialect input?
MSA-trained ASR shows materially higher WER on dialect speech than purpose-built dialect ASR. The gap narrows for closely-related dialects (eg. KSA Najdi vs UAE Gulf) and widens for distant dialects (eg. MSA-trained model on Maghrebi).
How is code-switching ASR different from monolingual Arabic ASR?
Code-switching ASR requires joint Arabic-English (or Arabic-French in Maghrebi) acoustic + language modelling. Token-level language ID is part of the annotation. The model must handle phoneme inventories from both languages.
Does Annota8 work with major Arabic ASR providers?
Yes. Annota8's API-based workflow integrates with custom data pipelines. We support customers training internal Arabic ASR + customers fine-tuning Whisper-class multilingual models on Arabic dialect data.
Discuss Arabic ASR training data → 30-min session Read audio annotation overview

References

Footnotes

  1. “Mutual Intelligibility of Arabic Dialects,” University of Arizona. https://journals.librarypublishing.arizona.edu/cms/article/6549/galley/6052/view/ 2

  2. Alhmoud et al., “Advancing Arabic Speech Recognition Through Large-Scale Weakly Supervised Learning,” arXiv:2504.12254. https://arxiv.org/html/2504.12254v2 ; “Overcoming Data Scarcity in Multi-Dialectal Arabic ASR via Whisper Fine-Tuning,” arXiv:2506.02627. https://arxiv.org/html/2506.02627v1 2

  3. Habash, Diab, Rambow, “Conventional Orthography for Dialectal Arabic,” LREC 2012. https://aclanthology.org/L12-1328/ ; CAMeL CODA guidelines. https://camel-guidelines.readthedocs.io/en/latest/orthography/ 2

  4. Najdi historical phonology, Theory & Practice in Language Studies. https://tpls.academypublication.com/index.php/tpls/article/download/5631/4523/15580

  5. Lebanese Arabic linguistic features (urban /q/ → /ʔ/). https://lebanese-arabic.com/lebanese-arabic-linguistic-features/

  6. “Egyptian Arabic phonology” overview (https://en.wikipedia.org/wiki/Egyptian_Arabic_phonology) ; Al-Kindi, “Social Stratification of Qaf in Egyptian Arabic,” IJLLT. https://al-kindipublisher.com/index.php/ijllt/article/download/2017/1709/4744

  7. Benkato, “Maghrebi Arabic,” Language Science Press. https://langsci-press.org/catalog/view/235/1811/1837-1 ; Oxford Academic, “Maghrebi dialects of Arabic.” https://academic.oup.com/book/26748/chapter/195613681

  8. Bouamor et al., “Best Practices for Crowdsourcing Dialectal Arabic Speech Transcription.” https://www.academia.edu/26822430/Best_Practices_for_Crowdsourcing_Dialectal_Arabic_Speech_Transcription