26 May 2026 Mena foundation model training data

How MENA foundation-model labs source training data

The five MENA foundation-model programs

Model	Sponsor	Country	Specialty
ALLaM	SDAIA	KSA	National LLM, MSA + Saudi dialect, government use cases[^1]
Jais	G42 / Inception / Cerebras / MBZUAI	UAE	First production Arabic LLM (2023). Jais 2 (70B) released December 2025 — current UAE flagship open-weight Arabic LLM[^2]
Fanar	QCRI	Qatar	Arabic-Islamic specialised LLM, quality-over-quantity thesis. Fanar 2.0 (late 2025)[^3]
Falcon	TII	UAE	Falcon-Arabic / Falcon-H1 Arabic — open-weight LLM family with native Arabic coverage[^4]
Karnak	Egypt AIC	Egypt	Officially launched February 2026 at AI Everything MEA Cairo; live on Hugging Face at `Applied-Innovation-Center/Karnak`[^5]

Each program differs on dialect coverage, base architecture, training corpus, and intended deployment.

Training data sourcing strategies

Web-scale pretraining corpus

All five labs begin with a web-scale Arabic pretraining corpus. Typical sources:

Common Crawl — filtered for Arabic language detection + quality
Arabic Wikipedia — high-quality MSA, ~1.2M articles[^7]
News + media archives — Al Jazeera, Al Arabiya, Asharq Al-Awsat, others
Government open data — laws, regulations, official communications
Books + literature — public domain + licensed corpora
Religious texts — Quranic + Hadith corpora for cultural grounding
Scientific + technical — Arabic academic publications, technical translations

Coverage skewed heavily toward MSA. Dialect coverage thin.

Curated SFT (supervised fine-tuning) corpus

After pretraining, models are fine-tuned on instruction-following pairs. Sourcing strategies:

Translation from English — translate Alpaca-class or ShareGPT-class datasets into Arabic
Native Arabic instruction generation — write Arabic instructions + responses from scratch
Hybrid — translate for breadth, native for cultural/dialect fidelity

The translation-only approach produces models that sound like translated English. Native Arabic SFT is the differentiator. This is where curated human annotation matters most.

RLHF preference corpus

For instruction-following + safety, models are aligned via RLHF (Reinforcement Learning from Human Feedback). Sourcing:

Generate candidate responses
Human annotators rank candidates
Train reward model on rankings
Use reward model to fine-tune base model

Eval set construction

Separately from training, each lab constructs evaluation sets:

Translated benchmarks (MMLU, HellaSwag, ARC translated to Arabic)
Native Arabic benchmarks (ArabicMMLU, AraBench, AlGhafa)[^8]
Dialect-stratified eval sets
Cultural alignment eval sets (Islamic cultural understanding, gender, politics)

What each program needs

ALLaM (SDAIA)

Saudi dialect coverage strengthening
MSA + Saudi government domain
Saudi cultural alignment eval sets
PDPL-aware in-Kingdom annotation pipeline
See ALLaM page

Jais (G42 / Inception)

UAE Gulf dialect strengthening
Multilingual Arabic-English code-switching
Enterprise alignment for UAE corporate use
See Jais page

Fanar (QCRI)

Curated quality > quantity thesis
Islamic cultural domain depth
Educational + religious alignment
See Fanar page

Falcon (TII)

Open-weight model family with Arabic coverage
Multilingual breadth
Open-source community engagement
See Falcon page

Karnak (Egypt AIC)

Egyptian cultural alignment
Arabic-Egyptian bilingual code-switching
See Karnak page

The quality-over-quantity shift

Fanar 2.0 (QCRI, late 2025) was a watershed.[^3] Despite using 8× fewer pre-training tokens than Fanar 1.0, it delivered substantial benchmark improvements (Arabic knowledge, language, dialects, and English).[^6] The signal: curated high-quality Arabic data beats undifferentiated web scrape.

This shift changes the annotation demand profile:

Pretraining curation matters more — dedup, quality filtering, classifier-based selection
SFT quality matters more — native Arabic instruction writing by PhD-linguists, not crowdsourced translation
Dialect stratification matters more — explicit coverage targets per family + sub-family
Cultural alignment matters more — explicit eval sets, explicit guideline-level cultural sensitivity

This is the work Annota8 was built for.

Where curated workforce changes the model

Three places where curated workforce + PhD-linguist QA materially moves the eval needle:

SFT native Arabic instruction writing — translated instructions sound like translations. Natively-written Arabic instructions produce models that sound native. Cairo PhD-linguists with technical domain coverage write 5-10x better instructions than crowdsourced translators.
RLHF preference ranking with cultural calibration — preference rankings that don’t account for Arabic cultural context produce misaligned models. Cairo + Riyadh annotators with explicit cultural calibration produce alignment data that holds up.
Dialect-stratified eval set construction — eval sets that lump Arabic dialects together hide model weaknesses. Dialect-family-stratified eval sets surface the gaps.

If your foundation-model program is not getting these three from your annotation vendor, you are leaving model quality on the table.

What Annota8 is designed to offer MENA foundation-model labs

Annota8 is being designed for MENA national foundation-model engagements. Capability targets — each scoped per engagement, not pre-certified blanket claims:

Pretraining-corpus curation (dedup, quality filtering, pretraining-corpus filtering)
SFT native Arabic instruction writing (Cairo PhD-linguist tier)
RLHF preference ranking with cultural calibration
Dialect-stratified eval set construction
Cultural alignment eval sets
Sovereign deployment patterns (KSA cloud or on-premise) as design targets
PDPL-aware design with MENA-resident operator team

Annota8 is in early-stage operations and does not hold formal compliance certifications today. Engagement starts with a controls-mapping conversation with the lab’s compliance team.

Discuss FM training data → 30-min session Read foundation-model solutions

Limitations & disclaimer

Limitations of this analysis. This post reflects Annota8's reading of publicly available evidence as of its last-modified date. Vendor positioning, regulatory frameworks, benchmark numbers, and program scope can change without notice. Where numeric ranges are cited, those numbers are reproducible from the source linked in the post's References section — Annota8 has not independently re-run the benchmarks unless explicitly stated in the post.

Privacy & legal posture. Annota8 is an early-stage AI data operations company in soft launch. We do not currently hold SOC 2, ISO 27001, PDPL certification, or any other third-party security or privacy certification. We design with PDPL principles in mind and can sign a DPA modelled on the EU SCC template. Specific compliance posture for your engagement is available on request from [email protected].

Nothing in this post is legal, tax, or investment advice. Regulatory citations should be verified with counsel in your jurisdiction. Vendor names mentioned in this post are referenced as industry-landscape context only — Annota8 is not asserting a comparative product claim, a customer relationship, or any other affiliation with any platform named, unless that affiliation is explicitly stated.

Reach the team:[email protected] · annota8.ai