All posts

Fine-tuning Whisper on Arabic dialect — annotation lessons

Why out-of-the-box Whisper underperforms on Arabic dialect

1. Pretraining corpus is MSA-heavy

Whisper v3 was trained on roughly 1M hours of weakly-labelled audio plus 4M hours of pseudo-labelled audio[^2] (the original Whisper paper described 680K hours;[^1] v2/v3 expanded the corpus). The Arabic subset includes:

The result: Whisper handles MSA well + dialect poorly.

2. Dialect families weren’t balanced

Even within the dialect portion, the corpus doesn’t balance Gulf vs Levantine vs Egyptian vs Maghrebi. Egyptian is over-represented in YouTube; Maghrebi is under-represented; sub-dialects (Najdi vs Hijazi, Cairene vs Upper Egyptian) are not stratified at all.

3. Code-switching is treated as monolingual

Whisper’s language detection assigns one language per utterance. Real MENA business + tech speech mixes Arabic + English (or French in Maghrebi) at the token level. The model handles this by either:

4. Phonetic variation is uncalibrated

Gulf /q/ → /g/,[^4] Levantine /q/ → /ʔ/,[^5] Egyptian /ʤ/ → /g/[^6] — these phonetic differences aren’t explicitly handled. Without dialect-stratified fine-tuning data, the acoustic model can’t disambiguate.

What fine-tuning achieves

The big lessons:

What good Arabic Whisper fine-tuning data looks like

Component 1: Dialect-stratified audio + transcript pairs

Minimum stratification (50-200 hours total):

Dialect familyHoursNotes
MSA20-50Broadcast + lecture + formal
Gulf10-40Najdi + Hijazi + Khaleeji blend
Levantine10-40Lebanese + Syrian + Jordanian
Egyptian10-40Cairene + Upper Egyptian + Alexandrian
Maghrebi5-20Moroccan + Algerian + Tunisian
Code-switching5-20Arabic-English (and Arabic-French for Maghrebi)

Component 2: Orthographic convention

Pick one + enforce it:

Mixed conventions confuse the model. Decide before annotation.

Component 3: Token-level language ID for code-switching

For mixed-language utterances, tag each token’s language:

"حجزت لكم MEETING بكرة في الـ CONFERENCE ROOM"
   ar     ar  en   ar    ar  ar  en          en

Whisper fine-tuned on language-tagged tokens handles code-switching far better than untagged training data.

Component 4: Time-aligned transcription

For higher-quality fine-tuning, especially for streaming ASR use cases, time-aligned word + phoneme transcription helps. Forced-alignment tools (Montreal Forced Aligner, WhisperX)[^7] get most of the way; PhD-linguist QA on edge cases gets the rest.

Component 5: Disfluency + non-speech handling policy

Explicit conventions for:

Without explicit conventions, annotators differ + the model can’t learn consistent behavior.

Component 6: PhD-linguist transcription QA

On 5-10% sample of the corpus:

Crowd-sourced Arabic dialect transcription accumulates label noise that compounds during fine-tuning. Targeted PhD-linguist QA reduces that noise.

Training recipe (rough)

For a serious Arabic dialect Whisper fine-tune:

  1. Pre-process audio — 16kHz mono, VAD-segmented to 5-30 second clips, normalise volume
  2. Stratify data — 60% production train / 20% validation / 20% test, stratified by dialect family
  3. Tokenisation — Whisper’s tokeniser handles Arabic but is suboptimal for dialect; consider extending vocabulary
  4. Fine-tune — usually 1-3 epochs on dialect-stratified data, lower LR (1e-5 typical), longer warmup
  5. Eval — WER per dialect family, per sub-dialect where possible, code-switching subset
  6. Iterate — identify worst-performing dialect family, expand that subset, refit

Common pitfalls

Pitfall 1: Training only on MSA, expecting dialect improvement

Doesn’t work. Dialect requires dialect-stratified data.

Pitfall 2: Crowd-sourcing dialect transcription without QA

Produces label noise that the model can’t overcome.

Pitfall 3: Mixed orthographic conventions

The model learns inconsistency, not language structure. Pick one + enforce.

Pitfall 4: No code-switching subset

Code-switching is the default in MENA tech speech. A model fine-tuned without it will fail in production.

Pitfall 5: Imbalanced dialect representation

If 80% of fine-tuning is Egyptian, the model overfits to Egyptian + underperforms on Gulf / Levantine / Maghrebi.

Pitfall 6: Forgetting MSA after dialect fine-tuning

Without MSA in fine-tuning data, the model can lose MSA capability. Always include MSA subset.

Where Annota8 helps

Annota8 builds dialect-stratified Whisper fine-tuning datasets across all six components above:

Common engagements:

Discuss your Whisper fine-tune → 30-min session Read audio annotation overview