All posts

Arabic-script OCR: handwritten, historical, and modern challenges in 2026

Why I keep writing about this

We work with banks in Riyadh and Cairo, healthcare networks in the GCC, and a few foundation-model labs that want their multimodal systems to read Arabic the way they read English. Almost every engagement starts with a customer saying “we already tried OCR — it didn’t work.” When we look at the pipeline, the OCR step is doing roughly what the literature predicts it can do — and the customer’s expectation, set by experience with English OCR, is roughly a decade ahead of where the Arabic stack actually sits.

That gap is the topic of this piece. Not “Arabic OCR is bad” — it isn’t, on the right inputs — but a calibrated read on which inputs the modern stack handles, which it doesn’t, and what annotation work moves the needle.

What makes Arabic script structurally harder

Latin OCR is a well-solved problem at this point because Latin script is convenient for computer vision: discrete letterforms separated by whitespace, a fixed left-to-right reading order, minimal contextual variation, and decades of digitised corpora at scale. Arabic offers almost none of those affordances.

Cursive script. Letters connect within a word. There is no whitespace between characters — the model has to learn the segmentation rather than treat it as a precondition. Even rule-based segmentation systems built in the 2000s spend most of their complexity here.

Contextual letter shapes. Each Arabic letter has up to four forms: isolated, initial, medial, and final.6 The letter heh appears as ه, ه‍, ‍ه‍, ‍ه depending on position. Twenty-eight base letters7 times up to four forms means the OCR vocabulary is materially larger than Latin’s at the glyph level, before you count anything else.

Ligatures. The lam-alif (لا) is the canonical example and is mandatory in most typesetting.8 Classical typography and Quranic typesetting include hundreds of decorative ligatures — the so-called Naskh-Thuluth lattice in particular. Modern digital fonts simplify, but historical print and calligraphy do not.

Tashkeel (diacritics). Eight diacritical marks — fatha, kasra, damma, sukun, shadda, fathatan, kasratan, dammatan — sit above or below the consonant skeleton.9 Modern Arabic usually omits them and readers fill in by context. Religious, legal, classical, and pedagogical text preserves them. Different OCR pipelines make different choices about whether to preserve or strip diacritics, and the choice affects downstream NLP — neither approach is universally correct.

Tatweel (kashida). The horizontal elongation character (ـ) extends letters for justification or aesthetic reasons. It is a real Unicode codepoint (U+0640)10 and it breaks token-based matching: the same word may appear as كتاب or كــتــاب and be the same word semantically. Most OCR pipelines need an explicit normalization step that either strips tatweel or preserves it consistently.

Right-to-left plus bidi handling. Arabic reads right-to-left. Embedded Latin tokens, numerals, and English brand names invert direction within the run. The Unicode Bidirectional Algorithm handles the rendering side, but OCR systems have to recover the logical order from a pixel grid where the visual order is mixed. Production failure mode: a transaction reference like REF-2026-001 in the middle of an Arabic narrative comes out reversed, hyphenated wrong, or attached to the adjacent Arabic token.

Dialect orthography variants. The same word can be spelled multiple legitimate ways depending on writer convention: ا أ إ آ for alef, ي ى for final yeh, ة ه for taa-marbuta versus heh. Egyptian writers often substitute ي for ى; Gulf writers tend to preserve the distinction; Maghreb writers introduce additional variants.11 OCR can either preserve the writer’s choice (which downstream NLP often dislikes) or normalize (which loses signal). The right answer depends on the downstream task.

Use case categories and where current systems land

Rather than quote a single accuracy number — meaningless without document context — I’ll segment by document class.

Modern printed Arabic, clean scan. Workable. The leading engines — Tesseract Arabic, Google Cloud Vision, AWS Textract, Mistral OCR, and GPT-4o Vision — all perform reasonably on clean modern Arabic with standard sans-serif fonts (Cairo, Tajawal, Almarai) at 300 DPI or better. Specific accuracy varies widely across published benchmarks depending on font, preprocessing, and engine. Failure modes are mostly tashkeel handling, embedded English tokens, and tables.

Modern printed Arabic, real-world scan. Accuracy degrades materially. Phone photos, skewed documents, low contrast, stamps, signatures, fold marks, and old photocopies push the same engines well below their clean-scan performance. For banking and government KYC pipelines, identity-field accuracy needs to be very high, which means a generic engine is usually insufficient and either post-OCR rule layers or domain fine-tuning is required.

Modern handwritten Arabic. Challenging. Without domain fine-tuning, general-purpose systems struggle on Ruqa handwriting (the most common everyday MENA hand). Doctor’s prescriptions are especially hard — small fonts, dense abbreviations, mixed Arabic and Latin drug names, idiosyncratic per-writer variation. Production-grade handwritten Arabic OCR almost always requires a domain-and-writer-style-specific fine-tune on labelled samples.

Historical manuscripts. Very hard, and the difficulty is not uniform across the historical corpus — it splits by script tradition.

Each of these wants its own training corpus, its own letter-form vocabulary, and often its own segmentation strategy. Treating “historical Arabic” as a single category is the fastest way to get a manuscript-digitization project to underperform.

Mixed-language documents. Common in MENA, brittle in practice. Moroccan and Tunisian official documents commonly mix Arabic and French. Older Ottoman and early-twentieth-century documents mixed Arabic with Turkish and Persian. Modern Gulf commercial documents mix Arabic and English. The bidi handling, the language-ID-per-token, and the font-class-per-token all have to be correct for downstream extraction to work. Most generic OCR engines treat the whole document as a single language.

What annotation work actually supports better Arabic OCR

When customers ask what kind of labelled data unlocks the next accuracy tier, the answer is rarely “more of the same.” It’s more granular.

This is more annotation per image than English OCR typically demands, which is part of why Arabic OCR training data costs more per page. It is also why off-the-shelf engines plateau where they plateau.

Useful datasets

For Arabic OCR practitioners, the public corpora worth knowing about:

Each of these is necessary but not sufficient for a production pipeline. They get a model to “literature baseline.” Production performance requires customer-specific samples on top.

Practical recommendations for buyers

If you’re scoping an Arabic OCR project — internally or with a vendor — the questions that matter:

  1. What’s the document class mix? Modern print, real-world scan, modern handwritten, historical manuscript, mixed-language. Each is a different model.
  2. What’s the production CER target by field? Define accuracy thresholds per field — identity fields are typically held to a much stricter bar than narrative fields. Don’t accept aggregate accuracy numbers.
  3. Is tashkeel preserved or stripped? Choose deliberately — both are valid, neither is free.
  4. How is bidi handled? Ask for a worked example with embedded English and numerals.
  5. What annotation tier are you funding? Pixel-level versus word-level versus character-level versus ligature-aware all cost different amounts and unlock different ceilings.
  6. Who’s the linguist QA? Native speakers with formal training in Arabic linguistics catch edge cases that pure ML loops miss — particularly around tashkeel, dialect orthography, and historical scripts.

What’s coming

Three trends I expect to shape the next eighteen months. First, multimodal foundation models — GPT-4o Vision, the Mistral OCR line, and the open-weights candidates — are closing the gap on modern printed Arabic faster than the dedicated OCR vendors. Second, the historical-manuscript corner is being pulled forward by digital humanities funding (especially in the Gulf), which means more annotated Naskh, Maghribi, and Diwani samples are entering the public sphere. Third, the dialect-orthography normalization problem is finally being treated as a separate stage in the pipeline rather than baked into the OCR model itself, which I think is correct.

The buyer side hasn’t caught up to the supplier side yet. Most procurement still asks for a single accuracy number and a single per-page price. The actual market is much more segmented than that and the work is much more annotation-heavy than the vendor decks suggest. We’ll keep writing about the segments.

How we help at Annota8

We’re a data-annotation operation, not an OCR vendor. The piece of the pipeline we own is the labelled training and evaluation data that makes a deployed Arabic OCR system land at the customer’s CER target. Our QA tier is built on PhD-level Arabic linguists in Cairo who handle tashkeel, dialect orthography, historical-script identification, and bidi edge cases natively. For modern printed and modern handwritten we typically source domain-matched samples and produce multi-layer annotation (line, word, character, ligature, tashkeel, script class, language ID). For historical manuscript work we partner with academic specialists in the relevant script tradition.

If you’re scoping an Arabic OCR project — banking, healthcare, legal, government, or manuscript digitization — that’s a 30-minute conversation we’d value.

Discuss your Arabic OCR project → 30-min session Read our Arabic OCR glossary

References

Footnotes

  1. Maghrebi script — Wikipedia. https://en.wikipedia.org/wiki/Maghrebi_script 2

  2. Kufic — Wikipedia. https://en.wikipedia.org/wiki/Kufic 2

  3. Diwani — Wikipedia. https://en.wikipedia.org/wiki/Diwani 2

  4. Thuluth — Wikipedia. https://en.wikipedia.org/wiki/Thuluth 2

  5. Naskh (script) — Wikipedia. https://en.wikipedia.org/wiki/Naskh_(script) 2

  6. Arabic alphabet — Wikipedia. https://en.wikipedia.org/wiki/Arabic_alphabet 2

  7. Britannica — Arabic alphabet. https://www.britannica.com/topic/Arabic-alphabet 2

  8. Microsoft Learn — Developing OpenType Fonts for Arabic Script. https://learn.microsoft.com/en-us/typography/script-development/arabic 2

  9. Arabic diacritics — Wikipedia. https://en.wikipedia.org/wiki/Arabic_diacritics 2

  10. U+0640 ARABIC TATWEEL — Unicode codepoints.net. https://codepoints.net/U+0640 2

  11. Egyptian Arabic Orthography — Lingualism. https://resources.lingualism.com/egyptian-arabic/egyptian-arabic-orthography/ 2

  12. Ruq’ah script — Wikipedia. https://en.wikipedia.org/wiki/Ruq%CA%BFah_script

  13. KHATT: An open Arabic offline handwritten text database — ScienceDirect. https://www.sciencedirect.com/science/article/abs/pii/S0031320313003300

  14. IFN/ENIT-database of handwritten Arabic words — IEEE / ResearchGate. https://www.researchgate.net/publication/228904501_IFNENIT-database_of_handwritten_Arabic_words

  15. MADCAT — Linguistic Data Consortium. https://www.ldc.upenn.edu/content/madcat

  16. A Database for Arabic Handwritten Text Image Recognition and Writer Identification — IEEE / ResearchGate. https://www.researchgate.net/publication/261156375_A_Database_for_Arabic_Handwritten_Text_Image_Recognition_and_Writer_Identification

  17. OpenITI corpus — Zenodo. https://zenodo.org/records/10007820