26 May 2026 ALLaM Karnak Fanar comparison

ALLaM v2 + Karnak + Fanar: a practitioner comparison of MENA training labs in 2026

TL;DR

Three Arabic national foundation models with production maturity as of mid-2026: ALLaM (SDAIA / HUMAIN, Saudi Arabia) — Arabic-centric LLM family, latest production deployment is ALLaM 34B powering HUMAIN Chat^[1]. Karnak (AIC, Egypt) — a depth-extended Qwen3-30B-A3B-Instruct-2507 model (~40B params) optimized for MSA + English, publicly launched February 11, 2026 at Ai Everything MEA Cairo^[2]; Egyptian colloquial coverage is handled by a separate AIC product (BelMasry)^[3]. Fanar 2.0 (QCRI) — a 27B model continually pretrained from Gemma-3-27B on ~166B curated tokens, roughly 6× fewer continual-pretraining tokens than Fanar 1.0’s ~1T, while improving on ArabicMMLU and Belebele^[4]. Jais and Falcon (UAE) remain important reference points. What moves the needle in production is not corpus size but: native Arabic SFT, culturally calibrated RLHF preference data, dialect-stratified eval sets, and Arabic-native jailbreak red-team work. Every number in this post comes from public model cards and lab papers.

Why these three, why now

After two years of an Arabic foundation-model race, the market has three tiers: government-backed national models (ALLaM, Fanar, Karnak), commercial-regional open-weight models (Jais, Falcon), and global frontier models with an Arabic layer (GPT-4o, Claude, Gemini, Llama 4). This article dissects the first tier because it is the one being purchased inside sovereign deals + public-sector integration mandates.

From a practitioner angle I will pin down the architecture, the data sourcing posture, the alignment strategy, the dialect coverage, the deployment options, and the actual gap between launch-day benchmarks and what shows up in production. Then I will show where curated labeling work fits in.

Quick-spec comparison table

Dimension	ALLaM (SDAIA / HUMAIN)	Karnak (AIC Egypt)	Fanar 2.0 (QCRI)
Sponsor	Saudi Data & AI Authority; operationalized via HUMAIN^[1]	Applied Innovation Center, Egypt (MCIT / ITIDA)^[2]	Qatar Computing Research Institute, HBKU^[4]
Public release	ALLaM technical paper July 2024^[5]; HUMAIN Chat with ALLaM 34B live 2025^[1]	Launched February 11, 2026 at Ai Everything MEA Cairo; live on Hugging Face at `Applied-Innovation-Center/Karnak`^[2]^[6]	Fanar 1.0 (2024); Fanar 2.0 announced December 9, 2025^[7]
Base architecture	Autoregressive decoder-only with Arabic vocabulary expansion, bilingual pretraining^[5]	Qwen3-30B-A3B-Instruct-2507, depth-extended to ~40B; Arabic-optimized tokenizer^[6]	Continual pretraining of Gemma-3-27B^[4]
Primary language coverage	Arabic (MSA + dialect coverage including Saudi dialects via speech input)^[1] + English	Arabic and English; model card lists both, does not explicitly call out dialect specialization^[6]. Egyptian colloquial is served by a separate AIC product, BelMasry^[3]	MSA + dialects (Gulf, Levantine, Egyptian) per model card^[4]
License	Proprietary / national gateway; ALLaM-2-7B-instruct available on Azure AI^[8]	Apache 2.0 (Hugging Face model card)^[6]	Apache 2.0 (Hugging Face model card)^[4]
Deployment	HUMAIN Chat consumer app; ALLaM-2-7B-instruct on Microsoft Azure AI Foundry^[1]^[8]	Hugging Face weights; Egyptian deployment via AIC ecosystem applications^[2]	Open weights on Hugging Face + Fanar platform^[4]^[7]
Training scale (public)	ALLaM-2-7B: 4T English tokens pretraining + 1.2T mixed Arabic/English tokens^[9]; ALLaM 13B: 3T total^[9]	Multi-stage pipeline (continued pretraining + SFT); total token count not published in the model card^[6]	Fanar 2.0: ~166B continual-pretraining tokens (Arabic + English + code), about 6× fewer than Fanar 1.0’s ~1T^[4]^[10]
Published benchmarks	State-of-the-art on MMLU Arabic, ACVA, Arabic Exams per ALLaM paper^[5]	None published in model card as of this article^[6]	ArabicMMLU 74.67%, MMLU 78.89%, Belebele 86.81%, GSM8K 93.70%^[4]

Caveat: dialect mixtures and SFT/RLHF distributions have not been fully published by any of the three labs. What is public is a high-level claim in the model card + technical paper.

ALLaM — deep dive

Architecture + corpus. ALLaM is a family of Arabic-centric models from SDAIA. The 2024 technical paper describes an autoregressive decoder-only architecture with vocabulary expansion for Arabic morphology and bilingual pretraining^[5]. ALLaM-2-7B-instruct’s published pretraining recipe is two steps: 4T English tokens, then 1.2T mixed Arabic/English tokens^[9]. SDAIA also publicized a 500B-token Arabic dataset as part of the broader effort^[9].

Operational ownership. With HUMAIN’s establishment in May 2025 (a merger that brought together SCAI, SDAIA’s AI model team, and other national technology units), HUMAIN now owns the ALLaM product roadmap^[1]. ALLaM 34B powers HUMAIN Chat, which features real-time web search, speech input across multiple Arabic dialects, and Arabic-English code-switching^[1].

Instruction tuning. SFT on Arabic instruction-response pairs + cleaned translations; human preference alignment incorporated into the published training recipe^[5]. Exact ratios of native-Arabic vs. translated SFT data have not been disclosed.

Published benchmarks. The ALLaM paper reports state-of-the-art results on MMLU Arabic, ACVA, and Arabic Exams at time of publication^[5].

What moves the needle in production (the gap):

ArabicMMLU is useful, but it does not measure Saudi-dialect conversational understanding. Models that score top on ArabicMMLU still need dialect-stratified eval sets to validate Najdi or Hejazi inputs.
Government-content alignment requires use-case-specific samples. Generalization from generic benchmarks does not produce ministry-specific responses.
Code-switching behavior between Arabic and English is the dominant input pattern for many Gulf customer-facing applications and is not well-covered by launch-day Arabic benchmarks.

Deployment. HUMAIN Chat as the consumer surface; ALLaM-2-7B-instruct is also published on Microsoft Azure AI Foundry’s model catalog^[8].

Commercial use cases. Government assistants, citizen-service portals, Arabic content responses inside Saudi apps, ICD/SNOMED linkage from Arabic clinical reports as an infrastructure layer.

Karnak — deep dive

Architecture + corpus. Karnak comes from Egypt’s Applied Innovation Center (AIC). It is a depth-extended causal language model built on top of Qwen/Qwen3-30B-A3B-Instruct-2507, with the model card listing ~40B parameters after depth extension, an Arabic-optimized tokenizer, and a safe context window up to 20,000 tokens^[6]. Training is a multi-stage pipeline: pre-trained weights → depth extension → continued pretraining → SFT^[6]. The corpus is described only as “high-quality, filtered data through a rigorous pipeline” — exact corpus composition is not published^[6].

Dialect posture. The model card describes Karnak as an Arabic-and-English model and does not claim specialization for Egyptian colloquial^[6]. In the AIC product line, Egyptian colloquial NLP is handled by a separate product, BelMasry, announced alongside Karnak at Ai Everything MEA 2026^[3]. Treating Karnak as “the Egyptian-dialect model” misstates its positioning.

Companion applications. Egypt’s MCIT announced a suite of Karnak-powered applications at launch: SIA (Arabic language and Egyptian history tutor), an AI legal and regulatory assistant, AcQua (call-center auditing), healthcare AI engines, Torgoman (translation), and Loghat (English education)^[2].

Published benchmarks. The Karnak Hugging Face model card lists no public benchmark scores as of the article’s writing^[6]. Until AIC publishes a technical report and/or third-party Open Arabic LLM Leaderboard (OALL) results, buyers cannot independently benchmark the model against ArabicMMLU, AlGhafa (TII), or other canonical Arabic evals.

What moves the needle in production (the gap):

The MSA-general-purpose model and the colloquial product are separate. A practitioner picking the right AIC tool for a use case needs to choose between Karnak (general-purpose Arabic + English) and BelMasry (Egyptian colloquial NLP).
No published native ArabicMMLU / AlGhafa / OALL eval scores yet — buyers cannot independently benchmark Karnak against peers.
Egyptian financial + legal use cases require additional domain-specific SFT layers on top.

Deployment. Apache 2.0 weights on Hugging Face^[6]; in-Egypt sovereign deployment through AIC ecosystem applications.

Commercial use cases. Egyptian education applications, citizen-service AI, public-sector legal/regulatory assistants, translation, call-center auditing.

Fanar 2.0 — deep dive

Architecture + corpus. Fanar 2.0 from QCRI (announced December 9, 2025 at World Summit AI Doha) is the most interesting of the three Arabic national models from a practitioner angle^[7]. It is a 27B-parameter model built by continual pretraining of google/gemma-3-27b-pt on approximately 166B tokens of curated Arabic, English, and code data — roughly 6× fewer continual-pretraining tokens than Fanar 1.0’s reported ~1T (QCRI rounds this as “approximately eight times fewer” in its own communications)^[4]^[10]. The ~6× ratio is a comparison against Fanar 1.0, not against peer models. Total compute was ~75,000 H100 GPU-hours^[4].

Instruction tuning. SFT on ~4M instructions; DPO on ~280K preference pairs^[4].

Published benchmarks. Per the Fanar-2-27B-Instruct model card: ArabicMMLU 74.67%, MMLU (English) 78.89%, Belebele 86.81%, GSM8K 93.70%^[4]. QCRI also reports +7.32 pts ArabicMMLU, +3.55 pts Belebele, and +7.57 pts MMLU vs Fanar 1.0^[4]^[10].

Dialect coverage. The Fanar 2.0 model card explicitly lists support for MSA + Gulf, Levantine, and Egyptian dialects^[4].

What moves the needle in production (the gap):

The “quality over quantity” thesis is supported by published benchmarks, but production performance on dialect-specific workloads still depends on application-specific evaluation.
Limited published Arabic red-team work across all three labs — the Arabic jailbreak surface (code-switching, transliteration, religiously framed prompts, tribally framed prompts) remains under-documented.

Deployment. Apache 2.0 weights on Hugging Face + the Fanar platform^[4]^[7].

Commercial use cases. Arabic educational content, Islamic-knowledge services, legal and Sharia text comprehension, Arabic academic content, translation.

Jais and Falcon — regional context

You cannot discuss Arabic foundation models without Jais (Inception / G42 / MBZUAI / Cerebras, UAE) and Falcon (TII, UAE):

Jais 30B was released November 9, 2023 with 126B Arabic tokens + 251B English tokens + 50B code tokens, trained on Cerebras’ Condor Galaxy-1 supercomputer^[11]. Jais is a foundational reference for production Arabic LLMs with an Arabic-native architectural design.
Falcon is TII’s family of open-weight models; Falcon 3 (released December 17, 2024) was trained on 14T tokens, and TII subsequently released Falcon Arabic built on the Falcon 3-7B architecture with native Arabic training data spanning MSA and regional dialects^[12]. Falcon Arabic targets the top of the Open Arabic LLM Leaderboard among regionally available models^[12]. Licensing is the TII Falcon License — an Apache-2.0-derived license with a commercial-revenue threshold^[12].

Many MENA product teams use Falcon or Jais as a base and then fine-tune with a domain-specific corpus. The three national models above serve a different market — sovereign deals + public sector + data-residency requirements.

What benchmarks measure vs. what moves the needle

What is measured	What is not measured (but matters)
ArabicMMLU (MBZUAI, multitask Arabic knowledge)	Deep dialect comprehension inside one family
MMLU (English, for cross-lingual comparison)	Cultural calibration in responses
AlGhafa (TII, native Arabic tasks)	Arabic-English code-switching behavior
Belebele (dialectal Arabic reading comprehension)	Arabic jailbreak resistance
OALL leaderboard (aggregated Arabic LLM eval)	Response quality in a specific use case (legal, medical, financial)
GSM8K / math evals (translated)	Multi-turn dialogue coherence in Arabic
Translated HellaSwag / ARC (commonsense, reasoning)	Response to instructions delivered in Saudi, Egyptian, or Levantine tone

Practitioner takeaway: an ArabicMMLU of 75% vs. 70% does not tell you which model will serve your application better. You need an application-specific eval set built from data that resembles your production traffic. That is eval set construction — and it is curated labeling work.

Where Annota8’s labeling work fits

From our experience serving foundation-model labs in MENA, four categories of labeling work materially move the production-evaluation needle:

Native-Arabic supervised fine-tuning (SFT). Instruction-response pairs written natively in Arabic from scratch by trained linguists — not crowd-sourced translation. For Saudi, Saudis write. For Egypt, Egyptians write. MSA is written by trained MSA linguists. See SFT.
Culturally calibrated RLHF preference pairs. Preference pairs rated by annotators who understand Arabic cultural context — polite vs. rude, religiously appropriate vs. inappropriate, professionally phrased vs. colloquial. This is what shifts an RLHF-tuned model toward locally appropriate responses. See RLHF.
Dialect-stratified evaluation set construction. Eval sets that carry explicit coverage targets across dialect identification — what share is MSA, Gulf, Egyptian, Levantine, Maghrebi. Eval sets that lump all dialects together hide model weakness. See eval set construction.
Arabic adversarial jailbreak red-team work. Arabic is vulnerable to different jailbreak patterns than English — code-switching, transliteration, religiously framed requests, tribally framed requests. Jailbreak red-team labeling builds an adversarial set to test alignment robustness in Arabic. This is a documented research gap across all three of these models.

All four are needed by every national model to varying degrees.

How to choose between them for your application

If you are building inside KSA for a government entity or a Saudi enterprise with in-Kingdom residency requirements: start with ALLaM via HUMAIN Chat surfaces or ALLaM-2-7B on Azure AI Foundry.

If you are building in Egypt with a general MSA + English use case: Karnak’s Apache 2.0 weights on Hugging Face are an open starting point. For Egyptian-colloquial workloads (call centers, social-media analytics, dialectal chatbots), pair Karnak with BelMasry or use Jais/Falcon as a base with curated Egyptian SFT layers.

If you are building an educational content app, Islamic-knowledge services, or legal/Sharia text comprehension: Fanar 2.0’s published benchmark profile and curated-data thesis make it a strong candidate.

If you are building a multi-dialect general Arabic application and want a permissively licensed base: Falcon Arabic + a domain-specific SFT layer is a credible path.

If you are building an enterprise app in UAE + Gulf and want a foundational open Arabic reference: Jais remains a frequently cited model.

In every case: layer your own labeling on top for your application domain. The base model matters, but the curated data specific to your application matters more.

References

Middle East AI News, “HUMAIN Chat goes live powered by ALLaM 34B LLM” (2025) — supports HUMAIN ownership of ALLaM roadmap, ALLaM 34B powering HUMAIN Chat, real-time web search and multi-dialect speech features.

ITIDA, “Egypt Launches Karnak: National AI Language Model at Ai Everything MEA 2026” (Feb 11, 2026) — supports Karnak launch date, AIC sponsor, companion applications list (SIA, BelMasry, AcQua, Torgoman, Loghat).

Middle East AI News coverage referencing BelMasry as AIC’s NLP engines for Egyptian colloquial Arabic (Feb 2026) — supports BelMasry as a separate AIC product for Egyptian colloquial, distinct from Karnak.

QCRI, “Fanar-2-27B-Instruct” model card on Hugging Face (Dec 2025) — supports Fanar 2.0 base model (Gemma-3-27B), 27B parameters, ~166B continual pretraining tokens, 75,000 H100 GPU-hours, ArabicMMLU 74.67%, MMLU 78.89%, Belebele 86.81%, GSM8K 93.70%, Apache 2.0 license, MSA + Gulf/Levantine/Egyptian dialect coverage, SFT 4M / DPO 280K.

Bari et al., “ALLaM: Large Language Models for Arabic and English”, arXiv:2407.15390 (July 2024) — supports ALLaM architecture (autoregressive decoder-only with Arabic vocabulary expansion, bilingual pretraining), state-of-the-art results on MMLU Arabic, ACVA, Arabic Exams.

Applied Innovation Center, “Karnak” model card on Hugging Face (2026) — supports Karnak base model (Qwen3-30B-A3B-Instruct-2507), ~40B parameter count after depth extension, Arabic-optimized tokenizer, 20,000-token safe context, Apache 2.0 license, multi-stage training pipeline.

Middle East AI News, “Qatar announces Fanar 2.0 Arabic AI model” (Dec 9, 2025) — supports Fanar 2.0 announcement date and World Summit AI Doha venue.

Microsoft, “Introducing SDAIA and Their Latest Arabic LLM on Azure AI Model Catalog” — supports ALLaM-2-7B-instruct availability on Microsoft Azure AI Foundry.

Microsoft Azure AI Foundry, “ALLaM-2-7b-instruct” model catalog page — supports ALLaM-2-7B pretraining recipe (4T English tokens + 1.2T mixed Arabic/English) and the 500B-token Arabic dataset claim.

Middle East AI News, “Qatar’s national AI platform’s powerful upgrade explained” — supports Fanar 2.0’s “~8x fewer tokens than Fanar 1.0” framing and the benchmark deltas (ArabicMMLU +7.3, Belebele +3.5, MMLU +7.6, Belebele dialectal).

Inception (G42), “G42 Sets New Benchmark for Arabic Large Language Models with the Release of JAIS 30B” (Nov 9, 2023) — supports Jais 30B release date and training token composition (126B Arabic + 251B English + 50B code).

TII, “Falcon 3” announcement and Falcon Arabic page — supports Falcon 3 release date (Dec 17, 2024), 14T training tokens, Falcon Arabic built on Falcon 3-7B with MSA + dialect data, TII Falcon License terms.

Talk to us about foundation-model labeling → 30-min call Read the foundation-model solutions page

Limitations & disclaimer

Limitations of this analysis. This post reflects Annota8's reading of publicly available evidence as of its last-modified date. Vendor positioning, regulatory frameworks, benchmark numbers, and program scope can change without notice. Where numeric ranges are cited, those numbers are reproducible from the source linked in the post's References section — Annota8 has not independently re-run the benchmarks unless explicitly stated in the post.

Privacy & legal posture. Annota8 is an early-stage AI data operations company in soft launch. We do not currently hold SOC 2, ISO 27001, PDPL certification, or any other third-party security or privacy certification. We design with PDPL principles in mind and can sign a DPA modelled on the EU SCC template. Specific compliance posture for your engagement is available on request from [email protected].

Nothing in this post is legal, tax, or investment advice. Regulatory citations should be verified with counsel in your jurisdiction. Vendor names mentioned in this post are referenced as industry-landscape context only — Annota8 is not asserting a comparative product claim, a customer relationship, or any other affiliation with any platform named, unless that affiliation is explicitly stated.

Reach the team:[email protected] · annota8.ai