Fine-tuning Whisper on Arabic dialect — annotation lessons
Why out-of-the-box Whisper underperforms on Arabic dialect
1. Pretraining corpus is MSA-heavy
Whisper v3 was trained on roughly 1M hours of weakly-labelled audio plus 4M hours of pseudo-labelled audio[^2] (the original Whisper paper described 680K hours;[^1] v2/v3 expanded the corpus). The Arabic subset includes:
- News broadcasts (Al Jazeera, Al Arabiya) — almost entirely MSA
- Lectures + religious content — heavy MSA
- YouTube + podcasts — mixed dialect but unbalanced
- Translation pairs (English-Arabic) — mostly MSA written form
The result: Whisper handles MSA well + dialect poorly.
2. Dialect families weren’t balanced
Even within the dialect portion, the corpus doesn’t balance Gulf vs Levantine vs Egyptian vs Maghrebi. Egyptian is over-represented in YouTube; Maghrebi is under-represented; sub-dialects (Najdi vs Hijazi, Cairene vs Upper Egyptian) are not stratified at all.
3. Code-switching is treated as monolingual
Whisper’s language detection assigns one language per utterance. Real MENA business + tech speech mixes Arabic + English (or French in Maghrebi) at the token level. The model handles this by either:
- Mis-recognising Latin tokens as Arabic (gibberish)
- Skipping Latin tokens entirely
- Translating to one language unintentionally
4. Phonetic variation is uncalibrated
Gulf /q/ → /g/,[^4] Levantine /q/ → /ʔ/,[^5] Egyptian /ʤ/ → /g/[^6] — these phonetic differences aren’t explicitly handled. Without dialect-stratified fine-tuning data, the acoustic model can’t disambiguate.
What fine-tuning achieves
The big lessons:
- MSA-only fine-tuning doesn’t improve dialect — dialect-stratified data is essential
- A code-switching subset materially improves code-switching WER
- Beyond a few hundred hours of dialect-stratified data, returns diminish but quality continues climbing slowly
What good Arabic Whisper fine-tuning data looks like
Component 1: Dialect-stratified audio + transcript pairs
Minimum stratification (50-200 hours total):
| Dialect family | Hours | Notes |
|---|---|---|
| MSA | 20-50 | Broadcast + lecture + formal |
| Gulf | 10-40 | Najdi + Hijazi + Khaleeji blend |
| Levantine | 10-40 | Lebanese + Syrian + Jordanian |
| Egyptian | 10-40 | Cairene + Upper Egyptian + Alexandrian |
| Maghrebi | 5-20 | Moroccan + Algerian + Tunisian |
| Code-switching | 5-20 | Arabic-English (and Arabic-French for Maghrebi) |
Component 2: Orthographic convention
Pick one + enforce it:
- CODA (Conventional Orthography for Dialectal Arabic) — standardised academic[^3]
- Native dialect orthography — let speakers write the way they would on social media
Mixed conventions confuse the model. Decide before annotation.
Component 3: Token-level language ID for code-switching
For mixed-language utterances, tag each token’s language:
"حجزت لكم MEETING بكرة في الـ CONFERENCE ROOM"
ar ar en ar ar ar en en
Whisper fine-tuned on language-tagged tokens handles code-switching far better than untagged training data.
Component 4: Time-aligned transcription
For higher-quality fine-tuning, especially for streaming ASR use cases, time-aligned word + phoneme transcription helps. Forced-alignment tools (Montreal Forced Aligner, WhisperX)[^7] get most of the way; PhD-linguist QA on edge cases gets the rest.
Component 5: Disfluency + non-speech handling policy
Explicit conventions for:
- Filled pauses (إإإ، اَمم، طيب)
- False starts + repairs (“بدّي… يعني بدّي أقول”)
- Backchannel (يلا، صح، طيب، عافاكي)
- Laughter, applause, background music
- Multi-speaker overlap
Without explicit conventions, annotators differ + the model can’t learn consistent behavior.
Component 6: PhD-linguist transcription QA
On 5-10% sample of the corpus:
- Native dialect verification
- Code-switching token assignment verification
- Time-alignment spot-check
- Cultural / register appropriateness check
Crowd-sourced Arabic dialect transcription accumulates label noise that compounds during fine-tuning. Targeted PhD-linguist QA reduces that noise.
Training recipe (rough)
For a serious Arabic dialect Whisper fine-tune:
- Pre-process audio — 16kHz mono, VAD-segmented to 5-30 second clips, normalise volume
- Stratify data — 60% production train / 20% validation / 20% test, stratified by dialect family
- Tokenisation — Whisper’s tokeniser handles Arabic but is suboptimal for dialect; consider extending vocabulary
- Fine-tune — usually 1-3 epochs on dialect-stratified data, lower LR (1e-5 typical), longer warmup
- Eval — WER per dialect family, per sub-dialect where possible, code-switching subset
- Iterate — identify worst-performing dialect family, expand that subset, refit
Common pitfalls
Pitfall 1: Training only on MSA, expecting dialect improvement
Doesn’t work. Dialect requires dialect-stratified data.
Pitfall 2: Crowd-sourcing dialect transcription without QA
Produces label noise that the model can’t overcome.
Pitfall 3: Mixed orthographic conventions
The model learns inconsistency, not language structure. Pick one + enforce.
Pitfall 4: No code-switching subset
Code-switching is the default in MENA tech speech. A model fine-tuned without it will fail in production.
Pitfall 5: Imbalanced dialect representation
If 80% of fine-tuning is Egyptian, the model overfits to Egyptian + underperforms on Gulf / Levantine / Maghrebi.
Pitfall 6: Forgetting MSA after dialect fine-tuning
Without MSA in fine-tuning data, the model can lose MSA capability. Always include MSA subset.
Where Annota8 helps
Annota8 builds dialect-stratified Whisper fine-tuning datasets across all six components above:
- Dialect-stratified audio sourcing (KSA + Egyptian + Levantine + Maghrebi)
- Native-dialect transcription with CODA or native orthography
- Token-level language ID for code-switching
- Time-aligned transcription with phoneme-level for hard cases
- Disfluency + non-speech convention enforcement
- Cairo PhD-linguist QA on 5-10% sample
Common engagements:
- 50-200 hours dialect-stratified Whisper fine-tune corpus
- 5-20 hours code-switching subset construction
- Dialect-stratified eval set (1-5 hours per family for benchmarking)
- Ongoing edge-case mining + active learning