How MENA foundation-model labs source training data
The five MENA foundation-model programs
| Model | Sponsor | Country | Specialty |
|---|---|---|---|
| ALLaM | SDAIA | KSA | National LLM, MSA + Saudi dialect, government use cases[^1] |
| Jais | G42 / Inception / Cerebras / MBZUAI | UAE | First production Arabic LLM (2023). Jais 2 (70B) released December 2025 — current UAE flagship open-weight Arabic LLM[^2] |
| Fanar | QCRI | Qatar | Arabic-Islamic specialised LLM, quality-over-quantity thesis. Fanar 2.0 (late 2025)[^3] |
| Falcon | TII | UAE | Falcon-Arabic / Falcon-H1 Arabic — open-weight LLM family with native Arabic coverage[^4] |
| Karnak | Egypt AIC | Egypt | Officially launched February 2026 at AI Everything MEA Cairo; live on Hugging Face at Applied-Innovation-Center/Karnak[^5] |
Each program differs on dialect coverage, base architecture, training corpus, and intended deployment.
Training data sourcing strategies
Web-scale pretraining corpus
All five labs begin with a web-scale Arabic pretraining corpus. Typical sources:
- Common Crawl — filtered for Arabic language detection + quality
- Arabic Wikipedia — high-quality MSA, ~1.2M articles[^7]
- News + media archives — Al Jazeera, Al Arabiya, Asharq Al-Awsat, others
- Government open data — laws, regulations, official communications
- Books + literature — public domain + licensed corpora
- Religious texts — Quranic + Hadith corpora for cultural grounding
- Scientific + technical — Arabic academic publications, technical translations
Coverage skewed heavily toward MSA. Dialect coverage thin.
Curated SFT (supervised fine-tuning) corpus
After pretraining, models are fine-tuned on instruction-following pairs. Sourcing strategies:
- Translation from English — translate Alpaca-class or ShareGPT-class datasets into Arabic
- Native Arabic instruction generation — write Arabic instructions + responses from scratch
- Hybrid — translate for breadth, native for cultural/dialect fidelity
The translation-only approach produces models that sound like translated English. Native Arabic SFT is the differentiator. This is where curated human annotation matters most.
RLHF preference corpus
For instruction-following + safety, models are aligned via RLHF (Reinforcement Learning from Human Feedback). Sourcing:
- Generate candidate responses
- Human annotators rank candidates
- Train reward model on rankings
- Use reward model to fine-tune base model
Eval set construction
Separately from training, each lab constructs evaluation sets:
- Translated benchmarks (MMLU, HellaSwag, ARC translated to Arabic)
- Native Arabic benchmarks (ArabicMMLU, AraBench, AlGhafa)[^8]
- Dialect-stratified eval sets
- Cultural alignment eval sets (Islamic cultural understanding, gender, politics)
What each program needs
ALLaM (SDAIA)
- Saudi dialect coverage strengthening
- MSA + Saudi government domain
- Saudi cultural alignment eval sets
- PDPL-aware in-Kingdom annotation pipeline
- See ALLaM page
Jais (G42 / Inception)
- UAE Gulf dialect strengthening
- Multilingual Arabic-English code-switching
- Enterprise alignment for UAE corporate use
- See Jais page
Fanar (QCRI)
- Curated quality > quantity thesis
- Islamic cultural domain depth
- Educational + religious alignment
- See Fanar page
Falcon (TII)
- Open-weight model family with Arabic coverage
- Multilingual breadth
- Open-source community engagement
- See Falcon page
Karnak (Egypt AIC)
- Egyptian cultural alignment
- Arabic-Egyptian bilingual code-switching
- See Karnak page
The quality-over-quantity shift
Fanar 2.0 (QCRI, late 2025) was a watershed.[^3] Despite using 8× fewer pre-training tokens than Fanar 1.0, it delivered substantial benchmark improvements (Arabic knowledge, language, dialects, and English).[^6] The signal: curated high-quality Arabic data beats undifferentiated web scrape.
This shift changes the annotation demand profile:
- Pretraining curation matters more — dedup, quality filtering, classifier-based selection
- SFT quality matters more — native Arabic instruction writing by PhD-linguists, not crowdsourced translation
- Dialect stratification matters more — explicit coverage targets per family + sub-family
- Cultural alignment matters more — explicit eval sets, explicit guideline-level cultural sensitivity
This is the work Annota8 was built for.
Where curated workforce changes the model
Three places where curated workforce + PhD-linguist QA materially moves the eval needle:
-
SFT native Arabic instruction writing — translated instructions sound like translations. Natively-written Arabic instructions produce models that sound native. Cairo PhD-linguists with technical domain coverage write 5-10x better instructions than crowdsourced translators.
-
RLHF preference ranking with cultural calibration — preference rankings that don’t account for Arabic cultural context produce misaligned models. Cairo + Riyadh annotators with explicit cultural calibration produce alignment data that holds up.
-
Dialect-stratified eval set construction — eval sets that lump Arabic dialects together hide model weaknesses. Dialect-family-stratified eval sets surface the gaps.
If your foundation-model program is not getting these three from your annotation vendor, you are leaving model quality on the table.
What Annota8 is designed to offer MENA foundation-model labs
Annota8 is being designed for MENA national foundation-model engagements. Capability targets — each scoped per engagement, not pre-certified blanket claims:
- Pretraining-corpus curation (dedup, quality filtering, pretraining-corpus filtering)
- SFT native Arabic instruction writing (Cairo PhD-linguist tier)
- RLHF preference ranking with cultural calibration
- Dialect-stratified eval set construction
- Cultural alignment eval sets
- Sovereign deployment patterns (KSA cloud or on-premise) as design targets
- PDPL-aware design with MENA-resident operator team
Annota8 is in early-stage operations and does not hold formal compliance certifications today. Engagement starts with a controls-mapping conversation with the lab’s compliance team.