26 May 2026 Whisper arabic fine-tuning

Fine-tuning Whisper on Arabic dialect — annotation lessons

Why out-of-the-box Whisper underperforms on Arabic dialect

1. Pretraining corpus is MSA-heavy

Whisper v3 was trained on roughly 1M hours of weakly-labelled audio plus 4M hours of pseudo-labelled audio[^2] (the original Whisper paper described 680K hours;[^1] v2/v3 expanded the corpus). The Arabic subset includes:

News broadcasts (Al Jazeera, Al Arabiya) — almost entirely MSA
Lectures + religious content — heavy MSA
YouTube + podcasts — mixed dialect but unbalanced
Translation pairs (English-Arabic) — mostly MSA written form

The result: Whisper handles MSA well + dialect poorly.

2. Dialect families weren’t balanced

Even within the dialect portion, the corpus doesn’t balance Gulf vs Levantine vs Egyptian vs Maghrebi. Egyptian is over-represented in YouTube; Maghrebi is under-represented; sub-dialects (Najdi vs Hijazi, Cairene vs Upper Egyptian) are not stratified at all.

3. Code-switching is treated as monolingual

Whisper’s language detection assigns one language per utterance. Real MENA business + tech speech mixes Arabic + English (or French in Maghrebi) at the token level. The model handles this by either:

Mis-recognising Latin tokens as Arabic (gibberish)
Skipping Latin tokens entirely
Translating to one language unintentionally

4. Phonetic variation is uncalibrated

Gulf /q/ → /g/,[^4] Levantine /q/ → /ʔ/,[^5] Egyptian /ʤ/ → /g/[^6] — these phonetic differences aren’t explicitly handled. Without dialect-stratified fine-tuning data, the acoustic model can’t disambiguate.

What fine-tuning achieves

The big lessons:

MSA-only fine-tuning doesn’t improve dialect — dialect-stratified data is essential
A code-switching subset materially improves code-switching WER
Beyond a few hundred hours of dialect-stratified data, returns diminish but quality continues climbing slowly

What good Arabic Whisper fine-tuning data looks like

Component 1: Dialect-stratified audio + transcript pairs

Minimum stratification (50-200 hours total):

Dialect family	Hours	Notes
MSA	20-50	Broadcast + lecture + formal
Gulf	10-40	Najdi + Hijazi + Khaleeji blend
Levantine	10-40	Lebanese + Syrian + Jordanian
Egyptian	10-40	Cairene + Upper Egyptian + Alexandrian
Maghrebi	5-20	Moroccan + Algerian + Tunisian
Code-switching	5-20	Arabic-English (and Arabic-French for Maghrebi)

Component 2: Orthographic convention

Pick one + enforce it:

CODA (Conventional Orthography for Dialectal Arabic) — standardised academic[^3]
Native dialect orthography — let speakers write the way they would on social media

Mixed conventions confuse the model. Decide before annotation.

Component 3: Token-level language ID for code-switching

For mixed-language utterances, tag each token’s language:

"حجزت لكم MEETING بكرة في الـ CONFERENCE ROOM"
   ar     ar  en   ar    ar  ar  en          en

Whisper fine-tuned on language-tagged tokens handles code-switching far better than untagged training data.

Component 4: Time-aligned transcription

For higher-quality fine-tuning, especially for streaming ASR use cases, time-aligned word + phoneme transcription helps. Forced-alignment tools (Montreal Forced Aligner, WhisperX)[^7] get most of the way; PhD-linguist QA on edge cases gets the rest.

Component 5: Disfluency + non-speech handling policy

Explicit conventions for:

Filled pauses (إإإ، اَمم، طيب)
False starts + repairs (“بدّي… يعني بدّي أقول”)
Backchannel (يلا، صح، طيب، عافاكي)
Laughter, applause, background music
Multi-speaker overlap

Without explicit conventions, annotators differ + the model can’t learn consistent behavior.

Component 6: PhD-linguist transcription QA

On 5-10% sample of the corpus:

Native dialect verification
Code-switching token assignment verification
Time-alignment spot-check
Cultural / register appropriateness check

Crowd-sourced Arabic dialect transcription accumulates label noise that compounds during fine-tuning. Targeted PhD-linguist QA reduces that noise.

Training recipe (rough)

For a serious Arabic dialect Whisper fine-tune:

Pre-process audio — 16kHz mono, VAD-segmented to 5-30 second clips, normalise volume
Stratify data — 60% production train / 20% validation / 20% test, stratified by dialect family
Tokenisation — Whisper’s tokeniser handles Arabic but is suboptimal for dialect; consider extending vocabulary
Fine-tune — usually 1-3 epochs on dialect-stratified data, lower LR (1e-5 typical), longer warmup
Eval — WER per dialect family, per sub-dialect where possible, code-switching subset
Iterate — identify worst-performing dialect family, expand that subset, refit

Common pitfalls

Pitfall 1: Training only on MSA, expecting dialect improvement

Doesn’t work. Dialect requires dialect-stratified data.

Pitfall 2: Crowd-sourcing dialect transcription without QA

Produces label noise that the model can’t overcome.

Pitfall 3: Mixed orthographic conventions

The model learns inconsistency, not language structure. Pick one + enforce.

Pitfall 4: No code-switching subset

Code-switching is the default in MENA tech speech. A model fine-tuned without it will fail in production.

Pitfall 5: Imbalanced dialect representation

If 80% of fine-tuning is Egyptian, the model overfits to Egyptian + underperforms on Gulf / Levantine / Maghrebi.

Pitfall 6: Forgetting MSA after dialect fine-tuning

Without MSA in fine-tuning data, the model can lose MSA capability. Always include MSA subset.

Where Annota8 helps

Annota8 builds dialect-stratified Whisper fine-tuning datasets across all six components above:

Dialect-stratified audio sourcing (KSA + Egyptian + Levantine + Maghrebi)
Native-dialect transcription with CODA or native orthography
Token-level language ID for code-switching
Time-aligned transcription with phoneme-level for hard cases
Disfluency + non-speech convention enforcement
Cairo PhD-linguist QA on 5-10% sample

Common engagements:

50-200 hours dialect-stratified Whisper fine-tune corpus
5-20 hours code-switching subset construction
Dialect-stratified eval set (1-5 hours per family for benchmarking)
Ongoing edge-case mining + active learning

Discuss your Whisper fine-tune → 30-min session Read audio annotation overview

Limitations & disclaimer

Limitations of this analysis. This post reflects Annota8's reading of publicly available evidence as of its last-modified date. Vendor positioning, regulatory frameworks, benchmark numbers, and program scope can change without notice. Where numeric ranges are cited, those numbers are reproducible from the source linked in the post's References section — Annota8 has not independently re-run the benchmarks unless explicitly stated in the post.

Privacy & legal posture. Annota8 is an early-stage AI data operations company in soft launch. We do not currently hold SOC 2, ISO 27001, PDPL certification, or any other third-party security or privacy certification. We design with PDPL principles in mind and can sign a DPA modelled on the EU SCC template. Specific compliance posture for your engagement is available on request from [email protected].

Nothing in this post is legal, tax, or investment advice. Regulatory citations should be verified with counsel in your jurisdiction. Vendor names mentioned in this post are referenced as industry-landscape context only — Annota8 is not asserting a comparative product claim, a customer relationship, or any other affiliation with any platform named, unless that affiliation is explicitly stated.

Reach the team:[email protected] · annota8.ai