All posts

Open-source vs proprietary Arabic LLMs in 2026: a practitioner decision framework

The 2026 landscape — what you are actually choosing between

The Arabic LLM market in 2026 is no longer a binary “GPT or nothing.” It is a populated shelf with two distinct halves.

Open-weight regional models — you download the weights, you host the inference, you can fine-tune.

Closed-API frontier models — you call the API, you cannot inspect the weights, and fine-tuning is only available through a vendor pipeline that keeps the resulting weights on the vendor’s infrastructure.

That is the shelf. The question is never “which is best.” It is “which is best for this workload, under these constraints, with this budget.”

Six decision dimensions

1. Cost — per-call API vs amortized hosting

The closed-API price is a clean number per million tokens. The open-weight price is a hosting bill (GPU instances, ops people, model-load time, idle capacity). At low monthly volume the closed API is almost always cheaper in total cost; at high monthly volume the open self-hosted route starts to win, depending on which sovereign region you are running in. Between those two zones you are in the negotiation zone — and the closed-API vendor knows it. (Crossover thresholds are workload-specific; run the numbers against your own traffic profile.)

2. Latency — distance to MENA endpoints

Closed-API frontier models route through US or EU regions today. From a Riyadh or Cairo backend, you should expect meaningful round-trip latency before you have even prompted. Open weights hosted in-region (Saudi National AI Cloud, G42 Core42, regional partners) cut that to a fraction of the cross-region path. For interactive voice, real-time translation, customer-service IVR — that difference is felt by users.

3. Sovereignty — NDMO classification + CLOUD Act exposure

If your data is classified under NDMO above the public/general-use tier in KSA, or if you are subject to the cross-border transfer provisions of the PDPL Implementing Regulations in Saudi Arabia or the equivalent in UAE, Qatar, Oman, Egypt — closed-API frontier models served from US or EU regions are usually disqualified by the data-residency requirement before you even discuss capability.[^8] Open weights hosted in-country are the path. CLOUD Act exposure compounds this: even data physically located in a non-US region but processed by a US-headquartered vendor can be subject to a US legal order.

4. Customization depth — prompt-only vs full fine-tune

Closed models give you a prompt and a small system message. All three frontier vendors (OpenAI, Anthropic via Bedrock, Google via Vertex AI) now offer fine-tuning, but in every case the resulting weights are hosted in the vendor’s pipeline — you cannot deploy them off-platform, and you do not own the weights file.[^6][^7] Open weights let you do full SFT, full RLHF, full RAG retrieval over your own corpus — and you own the resulting weights. If your value comes from proprietary Arabic data that no public model has seen, only the open path lets you fully exploit that.

5. Dialect coverage — open vs closed strengths

Open Arabic models are stronger today on conversational KSA dialect (ALLaM) and on conversational Egyptian dialect (Karnak), because they were trained with intentional dialect-stratified corpora. Closed frontier models are often stronger on MSA — particularly long-form formal writing — and on Egyptian (because Egyptian is the most-represented dialect on the public web that closed models were trained on). For Gulf dialects beyond Saudi (Emirati, Qatari, Omani, Bahraini, Kuwaiti) both sides are weaker than the marketing suggests, and the difference is mostly determined by whether your team has done eval work on dialect-specific holdout sets.

6. Compliance + audit trail

Closed-API providers log every call on their side. For an internal-audit conversation that can be a feature (the vendor has the immutable record) or a liability (the vendor has the immutable record). Open self-hosted lets you control the audit trail entirely — you choose what is logged, how long it is retained, who can access it, and whether it is exportable for regulator review. For NDMO-classified workloads or ZATCA-relevant financial data the latter is usually required.

The decision matrix — workload × constraint priority

WorkloadPrimary constraintRecommended family
Government / public-sector Arabic assistantSovereignty + audit trailALLaM v2 or Falcon Arabic, in-Kingdom hosted
Egyptian-dialect customer service / IVRDialect quality + latencyKarnak (fine-tuned) or Jais hosted in Cairo
Enterprise long-form document drafting (MSA)Quality ceilingClaude or GPT (closed API)
Internal company assistant on confidential dataSovereignty + customizationSelf-hosted open (ALLaM, Karnak, Jais, Falcon) + RAG
Public consumer chat with low marginCost at scaleOpen self-hosted once volume crosses the workload-specific crossover
Multimodal Arabic OCR + reasoningCapability ceilingGemini or Claude (closed) until open multimodal catches up
Regulated financial / clinicalAudit + sovereignty + dialectOpen self-hosted, fine-tuned on regulated corpus
Prototype + internal R&DSpeed to marketClosed API (Claude / GPT / Gemini), migrate later

This is not a ranking of which model is “best.” It is a constraint-first map. The same company running a citizen-service portal and an internal R&D sandbox should use different families for the two — and most mature MENA deployments do.

When to mix — hybrid stacks

The most common production pattern I see in 2026 is not pure open or pure closed. It is mixed.

Pattern 1 — Open embeddings + closed inference. Embed your Arabic corpus with an open-weight Arabic embedding model (ALLaM-derived or a Karnak-derived embedding), store in your sovereign vector DB, retrieve relevant context, then send only the assembled prompt to a closed API for high-quality generation. The sovereign data never leaves your region in raw form; only the retrieved context (which you control) is exposed.

Pattern 2 — Closed for non-sovereign, open for sovereign. Route every request through a classification layer first: restricted-tier under NDMO? → open self-hosted in-region. Public-tier or unclassified? → closed API for quality. The dispatcher logic is 50 lines of code; the compliance benefit is enormous.

Pattern 3 — Open base + closed refinement. Generate a draft with an open Arabic model (fast, in-region, cheap) and refine the cases that need higher quality with a closed model. Cuts closed-API spend substantially while preserving quality on the cases that matter.

Each pattern requires real eval work — none of them survive a “ship it and hope” deployment. That is where curated labeling, dialect-stratified holdout sets, and red-team data come in.

Honest disclosure — where Annota8 fits

Annota8 does annotation work for both halves of this shelf. Open-weight Arabic models need native Arabic SFT pairs, RLHF preference data, dialect-tagged corpora, and red-team adversarial sets to be production-grade. Closed-API customers building on top of frontier models need RAG-corpus curation, eval-set construction, and dialect-stratified quality benchmarking. Both sides need the same underlying input — humans who actually speak the dialect, who understand the domain, who can label at production quality. We do not have a horse in the open-vs-closed race. We have a horse in the get-Arabic-AI-actually-working race.

What to do this quarter

If you are starting a new Arabic LLM project in 2026:

  1. Classify your data first. NDMO level, PDPL applicability, CLOUD Act exposure. The answers eliminate half the shelf before you compare capability.
  2. Build your eval set before you choose a model. Dialect-stratified, domain-specific, drawn from your real users. Without this you cannot tell which family wins for you — you are reading other people’s marketing.
  3. Prototype on the cheapest path. Closed API for the first two weeks, almost always. Validate the use case. Then re-architect for the long-term constraint set.
  4. Plan the hybrid early. Even if you start pure closed or pure open, leave the dispatcher hook in your code from day one. You will need it.
Talk to us about your Arabic LLM stack → 30-min call Read the foundation-model solutions page