All posts

Ten years inside the industry

Annota8’s founding team spent more than a decade running AI data operations for a production-grade computer-vision and audio-AI business. Over that decade we evaluated, ran pilots with, partnered with, or otherwise engaged the major annotation platforms and vendor networks in the global ecosystem — the names any data-ops practitioner in the field would recognise: Scale AI, Labelbox, V7, SuperAnnotate, Encord, Kognic, Snorkel, iMerit, Sama, Surge AI, Appen, Toloka, CloudFactory. Across those engagements we bought tools, negotiated MSAs, onboarded vendor workforces, calibrated gold sets, fought QA escalations, and approved invoices in the millions.

Every one of those platforms was built for somewhere else. Most were built for autonomous-vehicle bounding boxes in California, or for content moderation pipelines servicing US social platforms, or for European e-commerce catalog enrichment, or for Chinese factory-floor computer vision. The product roadmaps, the workforce supply chains, the data-residency defaults, the language coverage — all of it pointed somewhere far from our region.

When a MENA AI team showed up with Arabic NLP, dialect ASR, regional image content, or a sovereignty constraint tied to a local data-protection law, the standard response was: “Send us labelled examples and we’ll see what we can do.” The standard outcome: a bespoke project that costs five times the catalog price, ships in twice the timeline, and leaves the customer’s data residency question unanswered.

After ten years of watching that pattern repeat, the founders decided to stop being customers and become operators.

The gap, in one paragraph

Arabic is not one language. It is a continuum. Modern Standard Arabic (MSA) sits on top for media, legal text, and religious context. Underneath, dozens of regional dialects shift by country, by city, by neighbourhood. In Saudi Arabia alone — Hejazi in Jeddah and the western coast (with the classical /q/ pronounced as /g/, softer vowel formants, and lexical influence from centuries of Hajj pilgrim contact), Najdi in Riyadh and the central plateau (with the /j/ shifting to /y/ in several lexical contexts), Eastern Arabic in Dammam and the Eastern Province (closer in features to Bahraini and Qatari speech), Southern varieties in Asir and Jizan (where Yemeni features bleed across the historical border). Cross one border into Egypt and Cairene and Saidi diverge enough that subtitles are routinely added on national broadcast. Cross into Morocco and Darija itself varies materially between Casablanca, Marrakesh, Fez, and Tangier — and code-switches with French in commercial registers. Iraq, Lebanon, Tunisia, Algeria, Sudan — same pattern, every time.

Global annotation platforms treat “Arabic” as one column in a language dropdown. The downstream consequence: models trained on that data hear an Egyptian taxi driver and reply in MSA grammar that nobody actually speaks in daily life; hear a Hejazi voice and miss intent classification entirely; transcribe a Moroccan call-centre conversation with a word-error-rate four to six times higher than the same model’s MSA benchmark. Workforce contracted as “Arabic-speakers” are then asked to QA dialect-stratified data they cannot natively handle. The quality bar collapses. The model degrades. The customer pays again.

Layer on top of the linguistic challenge the operational realities of the region:

None of that ships from a vendor headquartered in San Francisco, London, or Berlin. It is not a critique of those vendors — they are excellent at what they were built to do. It is a structural mismatch: a global platform optimised for a different region cannot be retrofitted into a MENA-native operation. The defaults are wrong all the way down.

Why Annota8, why now

The MENA region is in the middle of a generational AI buildout. National AI strategies in Saudi Arabia (SDAIA, the National Strategy for Data and AI, the Vision 2030 economic transformation), the UAE (the Artificial Intelligence Strategy 2031, G42’s regional model investments), Egypt (the National Council for AI’s roadmap), Qatar (QCRI’s Fanar program), and the GCC’s coordinated regional initiatives have all pushed Arabic-language AI from a research curiosity to a strategic priority. Foundation-model labs in the region (Allam from SDAIA, Fanar from QCRI, Jais from G42, Falcon from TII) need annotation infrastructure at production scale to ship production-grade Arabic-capable systems.

That infrastructure has to be built here. It cannot be imported from a vendor in another region, retrofitted at the edges, and called sovereign. It has to be designed from the ground up with the region’s linguistic, operational, regulatory, and cultural realities as first-class concerns — not afterthoughts.

That is the gap Annota8 was founded to close.

Our mission

Annota8’s mission is to be an ecosystem enabler for MENA AI. We are building the regional annotation operation that gives every AI team in the Middle East, North Africa, and the broader Arabic-speaking world the same caliber of tooling, workforce, and operational depth that teams in San Francisco have taken for granted for the last decade — but designed for the realities of this region, in this language, with this culture, served from inside the regulatory boundary the customer operates in.

Concretely, that means three commitments:

  1. Region-native by default, not retrofit. Arabic, MSA and the major dialects, are first-class. Operational rhythms (prayer times, Ramadan, Hijri scheduling) are first-class. Data-residency defaults map to local sovereign regimes. Workforce is hired, trained, and paid inside the region.
  2. Tools in the hands of the ecosystem. We are not trying to be the only AI operation in the region — we are trying to be the operation that every other AI team in the region can build on top of. Local universities, sovereign foundation-model programs, banking AI teams, healthcare AI start-ups, government digital-transformation offices, telco AI labs, voice-agent companies — all should be able to ship faster because Annota8 exists.
  3. Cultural understanding, not just linguistic. A model that understands Arabic grammar but does not understand a Hajj-density crowd flow, the boundary conditions on AAOIFI-compliant financial language, the modesty registers in voice-assistant interactions, or the sect-level diversity in religious-text annotation — is not a model the region can trust in production. The cultural layer matters. We are building for it.

Our vision

We want Annota8 to become the operational backbone of MENA AI — and from there, the region’s contribution to the global annotation industry. Future foundation models will need data labeled by people who live the languages and cultures the models are meant to serve. The MENA region has the demographic depth, the linguistic richness, the regulatory clarity, and now the strategic intent to be a major source of that data. We want Annota8 to be how that supply meets that demand — first for the region, then for the world.

If the last decade of the global annotation industry was built in California, the next decade has room for an operation built here. That is the company we are building.

Where we are today, honestly

Annota8 is in soft launch. We are a small, founder-led team with early engagements across academia, government-aligned innovation programs, and accelerator portfolios. We are not claiming a long customer list, a market-leading position, or a finished product. We are claiming a thesis, a team that has spent a decade earning the right to execute it, and a roadmap shaped by ten years of watching the same pattern repeat.

If you are building Arabic-language AI, MENA-region AI, or any AI system whose ground truth has to come from this part of the world — we want to be the conversation you have early, before the spreadsheets and the offshore-vendor workarounds and the QA escalations start.

Where to go next