All posts

Tamazight + Berber NLP for the Maghreb: an under-covered third language

A language family, not a single tongue

Tamazight is not one language. It is a branch of the Afro-Asiatic family, parallel to Semitic (which includes Arabic), with internal diversity comparable to the Romance languages — and treating it as a single corpus is the first mistake most NLP teams make.

The varieties that matter for commercial work in 2026:

A model trained on Kabyle social media data — which is what most academic Berber NLP papers in fact use — will not handle a Tashelhit voice query from Agadir or a Tarifit query from Nador. The varieties are mutually unintelligible in many cases. Pretending otherwise produces a vendor that loses SNRT and regional broadcaster procurement competitions to teams that did the work.

The Tifinagh script — not Arabic, not Latin

The script question is the second place vendors get caught. Tamazight in 2026 is written in three different scripts depending on the country, the institution and the speaker:

A vendor that handles only one script will fail half the market. A serious Tamazight pipeline needs script detection, script normalization, transliteration between all three, and the ability to produce output in whichever script the deployment context requires (Tifinagh for Moroccan public-sector, Latin for Algerian and diaspora media, Arabic-script for older religious or rural content).

This is a non-trivial engineering problem. The Tifinagh Unicode block is well-defined but still patchily supported in fonts and rendering libraries. The Latin Berber extensions overlap with extensions used in other African transcription systems, which causes encoding errors. Arabic-script Tamazight has no single standardized orthography — it varies by manuscript tradition.

Official status: Morocco 2011, Algeria 2016

This is not an academic exercise. The legal status of Tamazight in 2026:

Morocco. The 2011 constitutional revision (Article 5) made Tamazight an official language of the state alongside Arabic[^1]. The 2019 organic law 26-16 set out the implementation timeline for integrating Tamazight into education, justice, public administration, media and signage, with implementation delays of 5–15 years specified in Article 31[^3]. Implementation is uneven — but it is in motion. Public-school Tamazight instruction is rolling out under the organic law. Government websites are required to provide Tamazight versions. SNRT (Société Nationale de Radiodiffusion et de Télévision) runs Tamazight TV (SNRT 8, launched March 2010), broadcasting in Tachelhit, Tarifit and Central Atlas Tamazight[^7].

Algeria. The February 2016 constitutional revision made Tamazight a national and official language, though Arabic continues to be designated the language of the state[^2]. The implementation pace has differed from Morocco’s; the public broadcaster EPTV (Établissement Public de Télévision) maintains Berber-language programming, primarily in Kabyle, and Tamazight is taught in many wilayat as an optional subject.

Libya. No constitutional official status. Tamazight (primarily Nafusi and Zuwara varieties) has been used in some local administration in the Nafusa region post-2011, but federal recognition has not happened.

Tunisia, Mauritania, Mali, Niger, Egypt. Various community-level presence, no constitutional status (though Mali and Niger recognize Tuareg/Tamasheq as a national language with implementation support).

The procurement implication: any vendor pitching the Moroccan or Algerian public sector on an AI system that handles “national languages” but has no Tamazight story is presenting an incomplete bid. That is not a marketing observation — it is a constitutional one.

The standardization machinery: IRCAM, IPAC, university work

The institutional ecosystem behind Tamazight standardization matters because public-sector buyers look at it.

IRCAM (Institut Royal de la Culture Amazighe / Royal Institute of Amazigh Culture) — established in Rabat in 2001 by royal dahir (Dahir n° 1-01-299, signed 17 October 2001 by King Mohammed VI)[^4]. The central authority in Morocco for Tamazight standardization. Has produced the Neo-Tifinagh script encoding (Tifinagh Unicode block U+2D30–U+2D7F, included in Unicode 4.1 released March 2005)[^6], reference grammars, dictionaries, and the standard Tamazight that is taught in Moroccan public schools (which draws primarily from Central Tamazight with influences from Tashelhit and Tarifit)[^5]. IRCAM publishes ongoing lexicographic and corpus work and is a non-optional reference point for any serious Moroccan deployment.

HCA (Haut Commissariat à l’Amazighité / High Commission for Amazighity) and the Algerian Academy of the Amazigh Language (constitutionally created in 2016)[^2], plus University of Béjaïa and Mouloud Mammeri University of Tizi Ouzou, work on Kabyle corpus development, the standard Kabyle orthography, and educational materials.

Mohammed V University in Rabat and Cadi Ayyad University in Marrakech have NLP research groups producing Tamazight datasets and benchmarks. So does Université Mohammed Premier in Oujda for Tarifit work.

Notable academic resources and benchmarks:

These are all useful, all incomplete, and none of them is at the scale of what commercial Arabic NLP works with. The largest Tamazight datasets are at most low single-digit millions of tokens. Arabic models train on hundreds of billions. The gap is real.

What 2026 public-sector AI deployment actually demands

This is where the conversation shifts from academic to commercial.

Moroccan government digital services. The “Digital Morocco 2030” plan and the broader e-government push assume citizen-facing services in Arabic, French and Tamazight. A chatbot for Caisse Nationale de Sécurité Sociale, an IVR for a regional hospital, a public-procurement portal — any of these needs to handle Tamazight queries if the deployment is national. Vendors that show up with Arabic-only stacks lose to vendors who can demonstrate at least a credible Tamazight roadmap.

Algerian government digital services. Slower rollout but moving in the same direction. The Ministry of Post and Telecommunications and the Wilaya of Tizi Ouzou in particular have signaled Kabyle support requirements in tender language.

Broadcasting compliance. SNRT Tamazight TV[^7] and Algerian regional channels need ASR, subtitle generation, content moderation tooling and recommender systems that handle Tamazight content. Most commercial captioning vendors do not.

Education-system rollout. Moroccan and Algerian education ministries are deploying digital learning tools at scale, and Tamazight-language curriculum content is a hard requirement. This pulls in OCR for Tifinagh-script materials, TTS for accessibility, grammar-checking tooling.

Diaspora media and search. The Kabyle diaspora in France, Belgium and Canada is digitally active. Any media or social platform serving the Maghreb diaspora has Tamazight-language content to handle.

The point: the commercial demand is not hypothetical. The supply of qualified vendors is the bottleneck.

What Annota8 can and cannot do here — honestly

This is the part most vendor blog posts skip. I will not.

What we can do today. We have linguistics-PhD-tier annotation workforce in Cairo working in Arabic, including specialists in Maghrebi Arabic (Moroccan Darija, Algerian Derja, Tunisian Derja, Libyan Arabic). We can handle code-switched Arabic-Berber content where the Arabic carries most of the signal — for example a Moroccan customer-service transcript that has occasional Tashelhit phrases. We can do script detection and script normalization across Arabic, Latin and Tifinagh. We can build evaluation harnesses for Tamazight model outputs using IRCAM’s published reference materials as ground truth.

What we cannot do today. We do not have native Tashelhit, Tarifit or Kabyle annotators in our Cairo workforce at scale. For projects that need primary Tamazight annotation — say, building a Kabyle ASR training set, or annotating Tashelhit medical-consultation transcripts — we need to either subcontract or build out a Morocco-based or Algeria-based workforce. That is a commitment, not a quick fix.

What we are exploring for 2026. We have started conversations with annotation partners in Rabat and Algiers and with linguistics faculty at Mohammed V University and University of Béjaïa about a Tamazight-specific workforce expansion. The economics are different from Cairo — pricing in Rabat runs roughly 1.5-2x Cairo, in EUR-pegged MAD; Algiers in DZD runs closer to Cairo. The qualified-linguist supply for each Tamazight variety is more geographically distributed than the Arabic linguist supply — concentrated separately around the Tashelhit-speaking, Tarifit-speaking, and Central-Tamazight-speaking regions — so workforce-building here is a multi-city build, not a single-hub model. This is a real expansion, not a marketing claim.

What buyers should expect from us in the near term. Honesty about coverage. A clear statement of which Tamazight variety we can work on. A clear statement of which script we can produce. If the project demands native Kabyle speakers and we do not have them yet, we will say so rather than try to bluff it with second-language speakers. The cost of bluffing in this market is one bad delivery to SNRT or EPTV, and the door closes for years.

What to ask a vendor — a short checklist

If you are buying an Arabic / Maghreb NLP system in 2026 and Tamazight matters, ask:

  1. Which Tamazight varieties does your system support? (Tashelhit / Central / Tarifit / Kabyle / Tuareg / others?)
  2. Which scripts? (Tifinagh / Latin / Arabic — all three?)
  3. Where is your Tamazight annotation workforce based? (Cairo will not be a credible answer for primary Tamazight work.)
  4. What is your relationship to IRCAM or to the Algerian Tamazight academic community? (Have you used IRCAM corpora? Cited them? Reviewed your outputs against IRCAM grammar?)
  5. What is your performance gap between your strongest variety (usually Kabyle) and your weakest? Quantify it.
  6. Can you handle code-switching between Tamazight and Arabic and French in the same utterance? (Maghreb users do this constantly.)
  7. What is your TTS coverage? Most Tamazight TTS today sounds like a Tunisian Arabic speaker reading transliterated text — that is unacceptable for broadcast or for government services.

If a vendor cannot answer those seven questions with specificity, they are pitching aspirations, not capability.

Discuss a Tamazight or Maghreb NLP project → 30-minute call Read the FM alignment essay