Open-weight Arabic embeddings in 2026 — what’s available + production tradeoffs
Why this is harder in Arabic than in English
In English, picking an embedding model in 2026 is a 30-minute decision. You look at the MTEB leaderboard, pick the top-3, run them against your retrieval set, and ship the best one. In Arabic, the same workflow produces models that look great on benchmarks and behave badly in production.
Three reasons:
- Most public Arabic eval sets are translated MMLU-style benchmarks. They use Modern Standard Arabic (MSA), clean text, no diacritics, no code-switching, no dialect. Your production traffic is dialect, code-switched, often unvocalized, sometimes Latinized (“arabizi”). The benchmark winner is often the production loser.
- The Arabic embedding ecosystem is bimodal. There are Arabic-specialist BERT-family models trained on Arabic corpora (AraBERT, CAMeLBERT, MARBERT, ARBERTv2), and there are multilingual embedders that cover Arabic as one of 100+ languages (multilingual-e5, BGE-M3, Cohere embed-multilingual v3). The Arabic-specialists win on classification and dialect tasks; the multilingual models win on retrieval at modern context lengths.
- Sovereignty is a real constraint in the Gulf. “Just call the OpenAI API” is not a deployment option for a Saudi government RAG system. Open-weight, self-hostable, in-Kingdom inference is a hard requirement for an increasing share of the buyer landscape — see in-Kingdom vs sovereign data residency myths.
The framework below walks through what is actually available and what each model is good for.
The Arabic-specialist BERT family (open weights, 512 tokens, classification-strong)
These are the four canonical Arabic-trained transformer encoders. Each was a research output from a MENA-affiliated lab and remains widely used in Arabic NLP papers.
- AraBERT (Antoun et al., AUB MIND Lab, 2020)6. The first credible Arabic-pretrained BERT. Trained on a curated Arabic news + Wikipedia corpus. Multiple variants (v0.1, v0.2, large, twitter). Strong baseline for MSA classification, named entity recognition, sentiment.
- CAMeLBERT (Inoue et al., NYU Abu Dhabi, 2021)7. The CAMeL Lab release; published with dialect-aware variants (MSA, DA dialectal, CA classical, mix). Particularly useful when you care about classical Arabic or about explicit register control.
- MARBERT (Abdul-Mageed et al., UBC-NLP, 2021)8. Trained on a massive Twitter Arabic corpus, with sister model ARBERT trained on MSA. Strong on dialect identification and social-media Arabic.
- ARBERTv2 (UBC-NLP, 2022)9. Successor to ARBERT with a larger corpus, modern tokenizer treatment, and improved morphology handling.
All four are open-weight, all four are roughly BERT-base or BERT-large in parameter count, all four cap out at 512 tokens of context. That last point is the production blocker. A 2026 RAG system commonly chunks at 1K–2K tokens; a 512-cap encoder forces you to over-chunk and lose document-level coherence.
When the Arabic-specialists win in production: classification, dialect identification, sentiment, intent detection, NER, anywhere the input is naturally short and the task is supervised. They also remain the right base for fine-tuning a custom embedding for a specialized Arabic domain (legal, medical, Sharia) when you have the labeled pairs.
Open-weight multilingual embedders that cover Arabic well
This is where modern production Arabic RAG actually lives. These are general-purpose multilingual sentence/passage encoders, all open-weight, all self-hostable, all with context windows materially larger than 512.
- multilingual-e5 (Microsoft, Wang et al., 2024)10. Three sizes (small, base, large), all built on an XLM-RoBERTa backbone. 512-token context across the family — long texts get truncated. Strong out-of-the-box Arabic retrieval. Trained with weakly-supervised contrastive pre-training plus supervised fine-tuning. The “boring works” default in 2026 if your chunks fit inside 512 tokens.
- BGE-M3 (BAAI, 2024)1. Multilingual, multi-functional (supports dense, sparse, and multi-vector retrieval in one model), up to 8192 token context. Best in class for retrieval across MTEB multilingual subsets that include Arabic. Slightly heavier to operate than multilingual-e5 because the multi-functional output adds inference cost — but it is the legitimate long-context open-weight choice.
- JinaAI v3 multilingual embedding (Jina AI, 2024)11. Strong multilingual coverage, up to 8192 token context, task-conditioned embeddings via task LoRA adapters (you tell it whether you are embedding a query or a document, clustering, classification, etc.). Production-friendly licensing.
- Nomic embed (multilingual variant)12. Open-weight, 8192 token context, good general multilingual coverage. Less Arabic-specific tuning than BGE-M3 but credible as an alternative.
When the multilingual embedders win: anywhere you need long context (BGE-M3, JinaAI v3, Nomic embed all go to 8192 tokens), anywhere you need a single model serving Arabic + English + French (Maghrebi code-switching, Gulf code-switching), anywhere sovereignty requires self-hosting on in-Kingdom infrastructure.
Closed-API Arabic-capable embeddings
These are the credible commercial APIs in 2026 that handle Arabic well enough for production.
- OpenAI text-embedding-3-large / text-embedding-3-small3. Strong Arabic out of the box, dimensionality-reduction support via Matryoshka Representation Learning, 8191 token context. Closed weights, US inference, no fine-tune option for embeddings. Fast time-to-prototype but no sovereignty story.
- Cohere embed-multilingual v32. Best-in-class commercial multilingual coverage, 512 token cap (the one cap to watch), strong on Arabic, 100+ languages. The “Arabic just works” API choice.
- Voyage AI multilingual-24. Newer multilingual entrant with strong Arabic retrieval scores on internal evals. 32K context. Credible alternative to Cohere.
- Anthropic5. No public embedding API as of 2026. Anthropic’s docs explicitly redirect users to Voyage AI for embeddings. If you are building on Claude, you are pairing it with a third-party embedder.
When the closed APIs win: when sovereignty is not a constraint, when you do not need to fine-tune the embedder, when your team does not want to operate GPU inference, and when you accept that the embedder is now a vendor-locked dependency.
The decision dimensions that actually matter
Six dimensions, in the order they usually pin the choice:
1. Sovereignty
If the deployment must run on in-Kingdom infrastructure (KSA government, Saudi healthcare, sovereign cloud tenant for a regulated bank), closed APIs are out. You are picking from BGE-M3, JinaAI v3, Nomic embed, multilingual-e5, AraBERT, CAMeLBERT, MARBERT, or ARBERTv2. See the HUMAIN 2026 procurement practical read for what that buyer reality looks like.
2. Context length
If your chunks are larger than 512 tokens (which they should be for modern RAG with semantic chunking), the Arabic-specialist BERTs are awkward — and so is multilingual-e5, which also caps at 512 because of its XLM-RoBERTa backbone. Default to a long-context multilingual embedder: BGE-M3, JinaAI v3, Nomic embed, or Voyage AI multilingual-2. Cohere embed-multilingual v3 also caps at 512, which is the one thing to watch with that API.
3. Quality per workload
The relevant benchmarks for Arabic in 2026:
- Arabic STS-B (semantic textual similarity in Arabic).
- mr-tydi-ar (multilingual retrieval, Arabic subset)13.
- AraSciQ retrieval (Arabic scientific question answering, where it exists).
- Internal eval set — the one you build yourself, on production-realistic Arabic text. This is the one that matters.
We see multilingual-e5-large and BGE-M3 trade wins across these benchmarks. The differences are usually inside benchmark noise. The differences on a properly built internal Arabic eval set are not.
Note: the context-window gap matters here. multilingual-e5-large is competitive on benchmark retrieval at short input lengths but is bounded by 512 tokens. BGE-M3 is the model you reach for when chunks exceed that.
4. Latency + cost
Closed-API embedding pricing in 2026 is published per million tokens (not per query): OpenAI text-embedding-3-large at roughly 0.13 USD per million tokens and Cohere embed-multilingual v3 at roughly 0.10 USD per million tokens at list. A “query” in production typically bundles a chunked passage plus the user query, so cost-per-million-queries depends on your average chunk size and how aggressively you batch. Self-hosted open-weight embeddings on commodity GPU inference land in a different regime — directional estimates put them in the tens of dollars per million queries at modest scale, with the crossover where self-hosting beats API economics moving up as volume grows. Treat the specific numbers as directional and re-derive them against current vendor pricing pages and your own throughput before procurement.
5. Customization
Open-weight embedders can be fine-tuned on your domain (legal Arabic, medical Arabic, dialect-stratified retrieval pairs). Closed APIs cannot — or only via limited adapter APIs that do not reach the embedding layer. If your domain is materially different from generic Arabic (Sharia-compliant finance, Saudi labor law, GCC medical), fine-tuning is the lever that closes the gap.
6. Code-switching tolerance
For Maghrebi Arabic-French or Gulf Arabic-English production traffic, code-switching is constant. The Arabic-specialist BERTs degrade hard on code-switched input. The multilingual embedders (BGE-M3, JinaAI v3, multilingual-e5) handle it natively because they were trained on multilingual web data that includes code-switched text.
2026 picks by workload
This is the operator answer — what we actually recommend on labeling-scope calls, depending on the workload.
| Workload | 2026 recommendation | Why |
|---|---|---|
| High-volume sovereign in-Kingdom RAG, long chunks | BGE-M3, self-hosted | Open weights, 8192-token context, in-region inference, fine-tunable |
| High-volume sovereign in-Kingdom RAG, short chunks | multilingual-e5-large, self-hosted | Open weights, strong Arabic retrieval, but capped at 512 tokens |
| Quick start, non-sovereign, low ops headcount | Cohere embed-multilingual v3 | Strong Arabic, one API call, no infra (512 token cap) |
| Tashkeel-sensitive (religious, classical, Qur’anic, vocalized) | AraBERT or fine-tuned MARBERT on Tashkeela | Multilingual embedders handle unvocalized; specialists fine-tuned on Tashkeela handle tashkeel |
| Code-switched Arabic-English-French (Maghrebi, Gulf) | BGE-M3 or JinaAI v3 | Multilingual training data exposes the model to code-switching natively; long context |
| Arabic classification, dialect ID, NER, intent | MARBERT, ARBERTv2, CAMeLBERT-DA | Specialist Arabic BERTs still win on supervised short-input tasks |
| Closed-API prototype on Claude or GPT base | OpenAI text-embedding-3-large or Voyage AI multilingual-2 | Both handle Arabic, both are fast time-to-prototype |
| Fine-tunable domain embedder (legal, medical, Sharia) | AraBERT or BGE-M3 as base | Both open-weight, both with labeled-pair fine-tune recipes |
Why benchmark-only choice misses production reality
The single most common mistake we see: an FM lab or enterprise team picks an embedding model by reading the MTEB Arabic subset leaderboard, deploys it, and ships an Arabic RAG system that retrieves badly on real customer queries.
The problem is not the leaderboard. The problem is that the leaderboard is built on:
- Translated benchmarks (MMLU translated to Arabic, English-origin STS pairs translated to MSA). Production Arabic is not translated text; it is natively-written dialect and code-switched social text.
- Clean Modern Standard Arabic. Production Arabic is dialect, social-media noise, mixed orthography (with and without hamza, with and without tashkeel), and code-switching.
- Single-sentence semantic similarity. Production retrieval is multi-paragraph passages against natural-language queries.
The fix is to build your own eval set. We cover what that looks like in the next section. For the broader pattern, see Arabic LLM benchmark and Arabic LLM commercial failure diagnosis.
What annotation work supports embedding eval
This is where the embedding choice intersects with the labeling work we do at Annota8. To run a credible Arabic embedder eval — and to fine-tune one when the eval points there — you need three labeled artifacts:
- Dialect-stratified query-document relevance pairs. Real production queries, in the dialect your users speak, paired with the documents that should and should not be retrieved. Stratified across MSA, Egyptian, Gulf, Levantine, Maghrebi at minimum. Without dialect stratification you cannot tell whether the embedder is failing on retrieval or on dialect coverage.
- Semantic similarity ground truth. Native-Arabic sentence pairs scored on a continuous similarity scale, by Arabic linguists who actually speak the dialect of the source text. This is what calibrates STS-style eval to your production reality.
- Contrast pairs. “These two passages look similar but mean different things”; “these two passages look different but mean the same thing.” Contrast pairs are how you find the embedding failures that single-sentence similarity scoring hides.
All three are work that we routinely scope for FM labs and enterprise teams building Arabic RAG. The architecture choice belongs to you; the labeling design follows from it.
For the deeper picture of how labeling for Arabic foundation models compounds, see MENA foundation models training data and the MENA FM-lab training-data lead persona.
How to decide for your application — a checklist
Walk through these in order. The first hard constraint usually pins the choice.
- Sovereignty constraint? → Open-weight only (BGE-M3, JinaAI v3, Nomic embed, multilingual-e5, AraBERT, CAMeLBERT, MARBERT, ARBERTv2).
- Chunks larger than 512 tokens? → Long-context multilingual embedder (BGE-M3, JinaAI v3, Nomic embed, or Voyage AI multilingual-2 if API is OK).
- Fine-tune ambition? → Open-weight (any of the above), with BGE-M3, multilingual-e5, or AraBERT as the most common bases.
- Code-switched production traffic? → Multilingual over Arabic-specialist.
- Tashkeel-sensitive output? → AraBERT or fine-tuned MARBERT on Tashkeela.
- Quick non-sovereign prototype? → Cohere embed-multilingual v3 first; OpenAI text-embedding-3-large second.
If none of these constraints binds, default to BGE-M3 self-hosted if you need long context, multilingual-e5-large self-hosted if your chunks comfortably fit inside 512 tokens. These are the boring 2026 answers that survive most production deployments.
References
Footnotes
-
BAAI, “BGE-M3” model card — multilingual, multi-functional (dense, sparse, multi-vector retrieval), up to 8192 token context, 100+ languages. https://huggingface.co/BAAI/bge-m3 ↩ ↩2
-
Cohere, “embed-multilingual-v3.0” docs — 512 token max input, 1024 dimensions, 100+ languages. https://docs.cohere.com/docs/cohere-embed ↩ ↩2
-
OpenAI text-embedding-3 (large/small) — 8191 token context, Matryoshka Representation Learning for flexible dimensionality, released January 2024. Cross-referenced overview: https://www.pinecone.io/learn/openai-embeddings-v3/ ↩ ↩2
-
Voyage AI, “voyage-multilingual-2” announcement, June 10, 2024 — 32K context multilingual embedding model. https://blog.voyageai.com/2024/06/10/voyage-multilingual-2-multilingual-embedding-model/ ↩ ↩2
-
Anthropic platform documentation — embeddings page redirects users to Voyage AI; no first-party embedding API as of 2026-05. https://platform.claude.com/docs/en/build-with-claude/embeddings ↩ ↩2
-
Antoun, Baly, Hajj, “AraBERT: Transformer-based Model for Arabic Language Understanding,” OSACT 2020 (LREC). AUB MIND Lab. https://aclanthology.org/2020.osact-1.2/ and https://sites.aub.edu.lb/mindlab/ ↩
-
Inoue, Alhafni, Baimukan, Bouamor, Habash, “The Interplay of Variant, Size, and Task Type in Arabic Pre-trained Language Models,” WANLP 2021 — CAMeL Lab, NYU Abu Dhabi. https://aclanthology.org/2021.wanlp-1.10/ ↩
-
Abdul-Mageed, Elmadany, Nagoudi, “ARBERT & MARBERT: Deep Bidirectional Transformers for Arabic,” ACL 2021 — UBC-NLP. https://aclanthology.org/2021.acl-long.551/ ↩
-
ARBERTv2 model card, UBC-NLP; released alongside the ORCA benchmark paper (Elmadany, Nagoudi, Abdul-Mageed, arXiv:2212.10758, December 2022). https://huggingface.co/UBC-NLP/ARBERTv2 ↩
-
Wang, Yang, Huang, Yang, Majumder, Wei, “Multilingual E5 Text Embeddings: A Technical Report,” arXiv:2402.05672 (2024). XLM-RoBERTa backbone — model card explicitly notes “Long texts will be truncated to at most 512 tokens.” https://huggingface.co/intfloat/multilingual-e5-large ↩
-
Sturua et al., “jina-embeddings-v3: Multilingual Embeddings With Task LoRA,” arXiv:2409.10173 (2024). 570M parameters, 32 languages, up to 8192 tokens, task-specific LoRA adapters. https://arxiv.org/abs/2409.10173 ↩
-
Nussbaum, Morris, Duderstadt, Mulyar, “Nomic Embed: Training a Reproducible Long Context Text Embedder,” arXiv:2402.01613 (2024). 8192 token context, Apache 2.0. Multilingual variants exist as
nomic-embed-text-v2-moe. https://arxiv.org/abs/2402.01613 ↩ -
Zhang, Ma, Shi, Lin, “Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval,” arXiv:2108.08787 — Arabic included as one of 11 languages. https://huggingface.co/datasets/castorini/mr-tydi ↩