All posts

How MENA foundation-model labs source training data

The five MENA foundation-model programs

ModelSponsorCountrySpecialty
ALLaMSDAIAKSANational LLM, MSA + Saudi dialect, government use cases[^1]
JaisG42 / Inception / Cerebras / MBZUAIUAEFirst production Arabic LLM (2023). Jais 2 (70B) released December 2025 — current UAE flagship open-weight Arabic LLM[^2]
FanarQCRIQatarArabic-Islamic specialised LLM, quality-over-quantity thesis. Fanar 2.0 (late 2025)[^3]
FalconTIIUAEFalcon-Arabic / Falcon-H1 Arabic — open-weight LLM family with native Arabic coverage[^4]
KarnakEgypt AICEgyptOfficially launched February 2026 at AI Everything MEA Cairo; live on Hugging Face at Applied-Innovation-Center/Karnak[^5]

Each program differs on dialect coverage, base architecture, training corpus, and intended deployment.

Training data sourcing strategies

Web-scale pretraining corpus

All five labs begin with a web-scale Arabic pretraining corpus. Typical sources:

Coverage skewed heavily toward MSA. Dialect coverage thin.

Curated SFT (supervised fine-tuning) corpus

After pretraining, models are fine-tuned on instruction-following pairs. Sourcing strategies:

The translation-only approach produces models that sound like translated English. Native Arabic SFT is the differentiator. This is where curated human annotation matters most.

RLHF preference corpus

For instruction-following + safety, models are aligned via RLHF (Reinforcement Learning from Human Feedback). Sourcing:

Eval set construction

Separately from training, each lab constructs evaluation sets:

What each program needs

ALLaM (SDAIA)

Jais (G42 / Inception)

Fanar (QCRI)

Falcon (TII)

Karnak (Egypt AIC)

The quality-over-quantity shift

Fanar 2.0 (QCRI, late 2025) was a watershed.[^3] Despite using 8× fewer pre-training tokens than Fanar 1.0, it delivered substantial benchmark improvements (Arabic knowledge, language, dialects, and English).[^6] The signal: curated high-quality Arabic data beats undifferentiated web scrape.

This shift changes the annotation demand profile:

This is the work Annota8 was built for.

Where curated workforce changes the model

Three places where curated workforce + PhD-linguist QA materially moves the eval needle:

  1. SFT native Arabic instruction writing — translated instructions sound like translations. Natively-written Arabic instructions produce models that sound native. Cairo PhD-linguists with technical domain coverage write 5-10x better instructions than crowdsourced translators.

  2. RLHF preference ranking with cultural calibration — preference rankings that don’t account for Arabic cultural context produce misaligned models. Cairo + Riyadh annotators with explicit cultural calibration produce alignment data that holds up.

  3. Dialect-stratified eval set construction — eval sets that lump Arabic dialects together hide model weaknesses. Dialect-family-stratified eval sets surface the gaps.

If your foundation-model program is not getting these three from your annotation vendor, you are leaving model quality on the table.

What Annota8 is designed to offer MENA foundation-model labs

Annota8 is being designed for MENA national foundation-model engagements. Capability targets — each scoped per engagement, not pre-certified blanket claims:

Annota8 is in early-stage operations and does not hold formal compliance certifications today. Engagement starts with a controls-mapping conversation with the lab’s compliance team.

Discuss FM training data → 30-min session Read foundation-model solutions