All posts

ALLaM v2 + Karnak + Fanar: a practitioner comparison of MENA training labs in 2026

Why these three, why now

After two years of an Arabic foundation-model race, the market has three tiers: government-backed national models (ALLaM, Fanar, Karnak), commercial-regional open-weight models (Jais, Falcon), and global frontier models with an Arabic layer (GPT-4o, Claude, Gemini, Llama 4). This article dissects the first tier because it is the one being purchased inside sovereign deals + public-sector integration mandates.

From a practitioner angle I will pin down the architecture, the data sourcing posture, the alignment strategy, the dialect coverage, the deployment options, and the actual gap between launch-day benchmarks and what shows up in production. Then I will show where curated labeling work fits in.

Quick-spec comparison table

DimensionALLaM (SDAIA / HUMAIN)Karnak (AIC Egypt)Fanar 2.0 (QCRI)
SponsorSaudi Data & AI Authority; operationalized via HUMAIN[1]Applied Innovation Center, Egypt (MCIT / ITIDA)[2]Qatar Computing Research Institute, HBKU[4]
Public releaseALLaM technical paper July 2024[5]; HUMAIN Chat with ALLaM 34B live 2025[1]Launched February 11, 2026 at Ai Everything MEA Cairo; live on Hugging Face at Applied-Innovation-Center/Karnak[2][6]Fanar 1.0 (2024); Fanar 2.0 announced December 9, 2025[7]
Base architectureAutoregressive decoder-only with Arabic vocabulary expansion, bilingual pretraining[5]Qwen3-30B-A3B-Instruct-2507, depth-extended to ~40B; Arabic-optimized tokenizer[6]Continual pretraining of Gemma-3-27B[4]
Primary language coverageArabic (MSA + dialect coverage including Saudi dialects via speech input)[1] + EnglishArabic and English; model card lists both, does not explicitly call out dialect specialization[6]. Egyptian colloquial is served by a separate AIC product, BelMasry[3]MSA + dialects (Gulf, Levantine, Egyptian) per model card[4]
LicenseProprietary / national gateway; ALLaM-2-7B-instruct available on Azure AI[8]Apache 2.0 (Hugging Face model card)[6]Apache 2.0 (Hugging Face model card)[4]
DeploymentHUMAIN Chat consumer app; ALLaM-2-7B-instruct on Microsoft Azure AI Foundry[1][8]Hugging Face weights; Egyptian deployment via AIC ecosystem applications[2]Open weights on Hugging Face + Fanar platform[4][7]
Training scale (public)ALLaM-2-7B: 4T English tokens pretraining + 1.2T mixed Arabic/English tokens[9]; ALLaM 13B: 3T total[9]Multi-stage pipeline (continued pretraining + SFT); total token count not published in the model card[6]Fanar 2.0: ~166B continual-pretraining tokens (Arabic + English + code), about 6× fewer than Fanar 1.0’s ~1T[4][10]
Published benchmarksState-of-the-art on MMLU Arabic, ACVA, Arabic Exams per ALLaM paper[5]None published in model card as of this article[6]ArabicMMLU 74.67%, MMLU 78.89%, Belebele 86.81%, GSM8K 93.70%[4]

Caveat: dialect mixtures and SFT/RLHF distributions have not been fully published by any of the three labs. What is public is a high-level claim in the model card + technical paper.

ALLaM — deep dive

Architecture + corpus. ALLaM is a family of Arabic-centric models from SDAIA. The 2024 technical paper describes an autoregressive decoder-only architecture with vocabulary expansion for Arabic morphology and bilingual pretraining[5]. ALLaM-2-7B-instruct’s published pretraining recipe is two steps: 4T English tokens, then 1.2T mixed Arabic/English tokens[9]. SDAIA also publicized a 500B-token Arabic dataset as part of the broader effort[9].

Operational ownership. With HUMAIN’s establishment in May 2025 (a merger that brought together SCAI, SDAIA’s AI model team, and other national technology units), HUMAIN now owns the ALLaM product roadmap[1]. ALLaM 34B powers HUMAIN Chat, which features real-time web search, speech input across multiple Arabic dialects, and Arabic-English code-switching[1].

Instruction tuning. SFT on Arabic instruction-response pairs + cleaned translations; human preference alignment incorporated into the published training recipe[5]. Exact ratios of native-Arabic vs. translated SFT data have not been disclosed.

Published benchmarks. The ALLaM paper reports state-of-the-art results on MMLU Arabic, ACVA, and Arabic Exams at time of publication[5].

What moves the needle in production (the gap):

Deployment. HUMAIN Chat as the consumer surface; ALLaM-2-7B-instruct is also published on Microsoft Azure AI Foundry’s model catalog[8].

Commercial use cases. Government assistants, citizen-service portals, Arabic content responses inside Saudi apps, ICD/SNOMED linkage from Arabic clinical reports as an infrastructure layer.

Karnak — deep dive

Architecture + corpus. Karnak comes from Egypt’s Applied Innovation Center (AIC). It is a depth-extended causal language model built on top of Qwen/Qwen3-30B-A3B-Instruct-2507, with the model card listing ~40B parameters after depth extension, an Arabic-optimized tokenizer, and a safe context window up to 20,000 tokens[6]. Training is a multi-stage pipeline: pre-trained weights → depth extension → continued pretraining → SFT[6]. The corpus is described only as “high-quality, filtered data through a rigorous pipeline” — exact corpus composition is not published[6].

Dialect posture. The model card describes Karnak as an Arabic-and-English model and does not claim specialization for Egyptian colloquial[6]. In the AIC product line, Egyptian colloquial NLP is handled by a separate product, BelMasry, announced alongside Karnak at Ai Everything MEA 2026[3]. Treating Karnak as “the Egyptian-dialect model” misstates its positioning.

Companion applications. Egypt’s MCIT announced a suite of Karnak-powered applications at launch: SIA (Arabic language and Egyptian history tutor), an AI legal and regulatory assistant, AcQua (call-center auditing), healthcare AI engines, Torgoman (translation), and Loghat (English education)[2].

Published benchmarks. The Karnak Hugging Face model card lists no public benchmark scores as of the article’s writing[6]. Until AIC publishes a technical report and/or third-party Open Arabic LLM Leaderboard (OALL) results, buyers cannot independently benchmark the model against ArabicMMLU, AlGhafa (TII), or other canonical Arabic evals.

What moves the needle in production (the gap):

Deployment. Apache 2.0 weights on Hugging Face[6]; in-Egypt sovereign deployment through AIC ecosystem applications.

Commercial use cases. Egyptian education applications, citizen-service AI, public-sector legal/regulatory assistants, translation, call-center auditing.

Fanar 2.0 — deep dive

Architecture + corpus. Fanar 2.0 from QCRI (announced December 9, 2025 at World Summit AI Doha) is the most interesting of the three Arabic national models from a practitioner angle[7]. It is a 27B-parameter model built by continual pretraining of google/gemma-3-27b-pt on approximately 166B tokens of curated Arabic, English, and code data — roughly 6× fewer continual-pretraining tokens than Fanar 1.0’s reported ~1T (QCRI rounds this as “approximately eight times fewer” in its own communications)[4][10]. The ~6× ratio is a comparison against Fanar 1.0, not against peer models. Total compute was ~75,000 H100 GPU-hours[4].

Instruction tuning. SFT on ~4M instructions; DPO on ~280K preference pairs[4].

Published benchmarks. Per the Fanar-2-27B-Instruct model card: ArabicMMLU 74.67%, MMLU (English) 78.89%, Belebele 86.81%, GSM8K 93.70%[4]. QCRI also reports +7.32 pts ArabicMMLU, +3.55 pts Belebele, and +7.57 pts MMLU vs Fanar 1.0[4][10].

Dialect coverage. The Fanar 2.0 model card explicitly lists support for MSA + Gulf, Levantine, and Egyptian dialects[4].

What moves the needle in production (the gap):

Deployment. Apache 2.0 weights on Hugging Face + the Fanar platform[4][7].

Commercial use cases. Arabic educational content, Islamic-knowledge services, legal and Sharia text comprehension, Arabic academic content, translation.

Jais and Falcon — regional context

You cannot discuss Arabic foundation models without Jais (Inception / G42 / MBZUAI / Cerebras, UAE) and Falcon (TII, UAE):

Many MENA product teams use Falcon or Jais as a base and then fine-tune with a domain-specific corpus. The three national models above serve a different market — sovereign deals + public sector + data-residency requirements.

What benchmarks measure vs. what moves the needle

What is measuredWhat is not measured (but matters)
ArabicMMLU (MBZUAI, multitask Arabic knowledge)Deep dialect comprehension inside one family
MMLU (English, for cross-lingual comparison)Cultural calibration in responses
AlGhafa (TII, native Arabic tasks)Arabic-English code-switching behavior
Belebele (dialectal Arabic reading comprehension)Arabic jailbreak resistance
OALL leaderboard (aggregated Arabic LLM eval)Response quality in a specific use case (legal, medical, financial)
GSM8K / math evals (translated)Multi-turn dialogue coherence in Arabic
Translated HellaSwag / ARC (commonsense, reasoning)Response to instructions delivered in Saudi, Egyptian, or Levantine tone

Practitioner takeaway: an ArabicMMLU of 75% vs. 70% does not tell you which model will serve your application better. You need an application-specific eval set built from data that resembles your production traffic. That is eval set construction — and it is curated labeling work.

Where Annota8’s labeling work fits

From our experience serving foundation-model labs in MENA, four categories of labeling work materially move the production-evaluation needle:

  1. Native-Arabic supervised fine-tuning (SFT). Instruction-response pairs written natively in Arabic from scratch by trained linguists — not crowd-sourced translation. For Saudi, Saudis write. For Egypt, Egyptians write. MSA is written by trained MSA linguists. See SFT.

  2. Culturally calibrated RLHF preference pairs. Preference pairs rated by annotators who understand Arabic cultural context — polite vs. rude, religiously appropriate vs. inappropriate, professionally phrased vs. colloquial. This is what shifts an RLHF-tuned model toward locally appropriate responses. See RLHF.

  3. Dialect-stratified evaluation set construction. Eval sets that carry explicit coverage targets across dialect identification — what share is MSA, Gulf, Egyptian, Levantine, Maghrebi. Eval sets that lump all dialects together hide model weakness. See eval set construction.

  4. Arabic adversarial jailbreak red-team work. Arabic is vulnerable to different jailbreak patterns than English — code-switching, transliteration, religiously framed requests, tribally framed requests. Jailbreak red-team labeling builds an adversarial set to test alignment robustness in Arabic. This is a documented research gap across all three of these models.

All four are needed by every national model to varying degrees.

How to choose between them for your application

If you are building inside KSA for a government entity or a Saudi enterprise with in-Kingdom residency requirements: start with ALLaM via HUMAIN Chat surfaces or ALLaM-2-7B on Azure AI Foundry.

If you are building in Egypt with a general MSA + English use case: Karnak’s Apache 2.0 weights on Hugging Face are an open starting point. For Egyptian-colloquial workloads (call centers, social-media analytics, dialectal chatbots), pair Karnak with BelMasry or use Jais/Falcon as a base with curated Egyptian SFT layers.

If you are building an educational content app, Islamic-knowledge services, or legal/Sharia text comprehension: Fanar 2.0’s published benchmark profile and curated-data thesis make it a strong candidate.

If you are building a multi-dialect general Arabic application and want a permissively licensed base: Falcon Arabic + a domain-specific SFT layer is a credible path.

If you are building an enterprise app in UAE + Gulf and want a foundational open Arabic reference: Jais remains a frequently cited model.

In every case: layer your own labeling on top for your application domain. The base model matters, but the curated data specific to your application matters more.

References

  1. Middle East AI News, “HUMAIN Chat goes live powered by ALLaM 34B LLM” (2025) — supports HUMAIN ownership of ALLaM roadmap, ALLaM 34B powering HUMAIN Chat, real-time web search and multi-dialect speech features.
  1. ITIDA, “Egypt Launches Karnak: National AI Language Model at Ai Everything MEA 2026” (Feb 11, 2026) — supports Karnak launch date, AIC sponsor, companion applications list (SIA, BelMasry, AcQua, Torgoman, Loghat).
  1. Middle East AI News coverage referencing BelMasry as AIC’s NLP engines for Egyptian colloquial Arabic (Feb 2026) — supports BelMasry as a separate AIC product for Egyptian colloquial, distinct from Karnak.
  1. QCRI, “Fanar-2-27B-Instruct” model card on Hugging Face (Dec 2025) — supports Fanar 2.0 base model (Gemma-3-27B), 27B parameters, ~166B continual pretraining tokens, 75,000 H100 GPU-hours, ArabicMMLU 74.67%, MMLU 78.89%, Belebele 86.81%, GSM8K 93.70%, Apache 2.0 license, MSA + Gulf/Levantine/Egyptian dialect coverage, SFT 4M / DPO 280K.
  1. Bari et al., “ALLaM: Large Language Models for Arabic and English”, arXiv:2407.15390 (July 2024) — supports ALLaM architecture (autoregressive decoder-only with Arabic vocabulary expansion, bilingual pretraining), state-of-the-art results on MMLU Arabic, ACVA, Arabic Exams.
  1. Applied Innovation Center, “Karnak” model card on Hugging Face (2026) — supports Karnak base model (Qwen3-30B-A3B-Instruct-2507), ~40B parameter count after depth extension, Arabic-optimized tokenizer, 20,000-token safe context, Apache 2.0 license, multi-stage training pipeline.
  1. Middle East AI News, “Qatar announces Fanar 2.0 Arabic AI model” (Dec 9, 2025) — supports Fanar 2.0 announcement date and World Summit AI Doha venue.
  1. Microsoft, “Introducing SDAIA and Their Latest Arabic LLM on Azure AI Model Catalog” — supports ALLaM-2-7B-instruct availability on Microsoft Azure AI Foundry.
  1. Microsoft Azure AI Foundry, “ALLaM-2-7b-instruct” model catalog page — supports ALLaM-2-7B pretraining recipe (4T English tokens + 1.2T mixed Arabic/English) and the 500B-token Arabic dataset claim.
  1. Middle East AI News, “Qatar’s national AI platform’s powerful upgrade explained” — supports Fanar 2.0’s “~8x fewer tokens than Fanar 1.0” framing and the benchmark deltas (ArabicMMLU +7.3, Belebele +3.5, MMLU +7.6, Belebele dialectal).
  1. Inception (G42), “G42 Sets New Benchmark for Arabic Large Language Models with the Release of JAIS 30B” (Nov 9, 2023) — supports Jais 30B release date and training token composition (126B Arabic + 251B English + 50B code).
  1. TII, “Falcon 3” announcement and Falcon Arabic page — supports Falcon 3 release date (Dec 17, 2024), 14T training tokens, Falcon Arabic built on Falcon 3-7B with MSA + dialect data, TII Falcon License terms.
Talk to us about foundation-model labeling → 30-min call Read the foundation-model solutions page