Arabic-script OCR: handwritten, historical, and modern challenges in 2026
Why I keep writing about this
We work with banks in Riyadh and Cairo, healthcare networks in the GCC, and a few foundation-model labs that want their multimodal systems to read Arabic the way they read English. Almost every engagement starts with a customer saying “we already tried OCR — it didn’t work.” When we look at the pipeline, the OCR step is doing roughly what the literature predicts it can do — and the customer’s expectation, set by experience with English OCR, is roughly a decade ahead of where the Arabic stack actually sits.
That gap is the topic of this piece. Not “Arabic OCR is bad” — it isn’t, on the right inputs — but a calibrated read on which inputs the modern stack handles, which it doesn’t, and what annotation work moves the needle.
What makes Arabic script structurally harder
Latin OCR is a well-solved problem at this point because Latin script is convenient for computer vision: discrete letterforms separated by whitespace, a fixed left-to-right reading order, minimal contextual variation, and decades of digitised corpora at scale. Arabic offers almost none of those affordances.
Cursive script. Letters connect within a word. There is no whitespace between characters — the model has to learn the segmentation rather than treat it as a precondition. Even rule-based segmentation systems built in the 2000s spend most of their complexity here.
Contextual letter shapes. Each Arabic letter has up to four forms: isolated, initial, medial, and final.6 The letter heh appears as ه, ه, ه, ه depending on position. Twenty-eight base letters7 times up to four forms means the OCR vocabulary is materially larger than Latin’s at the glyph level, before you count anything else.
Ligatures. The lam-alif (لا) is the canonical example and is mandatory in most typesetting.8 Classical typography and Quranic typesetting include hundreds of decorative ligatures — the so-called Naskh-Thuluth lattice in particular. Modern digital fonts simplify, but historical print and calligraphy do not.
Tashkeel (diacritics). Eight diacritical marks — fatha, kasra, damma, sukun, shadda, fathatan, kasratan, dammatan — sit above or below the consonant skeleton.9 Modern Arabic usually omits them and readers fill in by context. Religious, legal, classical, and pedagogical text preserves them. Different OCR pipelines make different choices about whether to preserve or strip diacritics, and the choice affects downstream NLP — neither approach is universally correct.
Tatweel (kashida). The horizontal elongation character (ـ) extends letters for justification or aesthetic reasons. It is a real Unicode codepoint (U+0640)10 and it breaks token-based matching: the same word may appear as كتاب or كــتــاب and be the same word semantically. Most OCR pipelines need an explicit normalization step that either strips tatweel or preserves it consistently.
Right-to-left plus bidi handling. Arabic reads right-to-left. Embedded Latin tokens, numerals, and English brand names invert direction within the run. The Unicode Bidirectional Algorithm handles the rendering side, but OCR systems have to recover the logical order from a pixel grid where the visual order is mixed. Production failure mode: a transaction reference like REF-2026-001 in the middle of an Arabic narrative comes out reversed, hyphenated wrong, or attached to the adjacent Arabic token.
Dialect orthography variants. The same word can be spelled multiple legitimate ways depending on writer convention: ا أ إ آ for alef, ي ى for final yeh, ة ه for taa-marbuta versus heh. Egyptian writers often substitute ي for ى; Gulf writers tend to preserve the distinction; Maghreb writers introduce additional variants.11 OCR can either preserve the writer’s choice (which downstream NLP often dislikes) or normalize (which loses signal). The right answer depends on the downstream task.
Use case categories and where current systems land
Rather than quote a single accuracy number — meaningless without document context — I’ll segment by document class.
Modern printed Arabic, clean scan. Workable. The leading engines — Tesseract Arabic, Google Cloud Vision, AWS Textract, Mistral OCR, and GPT-4o Vision — all perform reasonably on clean modern Arabic with standard sans-serif fonts (Cairo, Tajawal, Almarai) at 300 DPI or better. Specific accuracy varies widely across published benchmarks depending on font, preprocessing, and engine. Failure modes are mostly tashkeel handling, embedded English tokens, and tables.
Modern printed Arabic, real-world scan. Accuracy degrades materially. Phone photos, skewed documents, low contrast, stamps, signatures, fold marks, and old photocopies push the same engines well below their clean-scan performance. For banking and government KYC pipelines, identity-field accuracy needs to be very high, which means a generic engine is usually insufficient and either post-OCR rule layers or domain fine-tuning is required.
Modern handwritten Arabic. Challenging. Without domain fine-tuning, general-purpose systems struggle on Ruqa handwriting (the most common everyday MENA hand). Doctor’s prescriptions are especially hard — small fonts, dense abbreviations, mixed Arabic and Latin drug names, idiosyncratic per-writer variation. Production-grade handwritten Arabic OCR almost always requires a domain-and-writer-style-specific fine-tune on labelled samples.
Historical manuscripts. Very hard, and the difficulty is not uniform across the historical corpus — it splits by script tradition.
- Naskh is the standard manuscript script for most classical Arabic literature and the easiest historical target because it most closely resembles modern print.5
- Maghribi is the North African and Andalusian tradition, with letter forms and diacritic placement that differ enough from Mashreqi Naskh that a model trained on Naskh fails outright.1
- Kufic is the early geometric script used in monumental inscriptions and early Quranic manuscripts — visually beautiful and structurally very different from modern script.2
- Diwani is the cursive script developed for Ottoman court documents — dense, ornate, and rarely covered by generic training data.3
- Thuluth is a large calligraphic script used in architectural inscriptions and decorative contexts.4
- Riqa (the Ottoman Ruq’ah hand, related to but distinct from the medieval Riqa) is the simplified everyday hand that bridges historical and modern. Modern handwriting is closest to Riqa.12
Each of these wants its own training corpus, its own letter-form vocabulary, and often its own segmentation strategy. Treating “historical Arabic” as a single category is the fastest way to get a manuscript-digitization project to underperform.
Mixed-language documents. Common in MENA, brittle in practice. Moroccan and Tunisian official documents commonly mix Arabic and French. Older Ottoman and early-twentieth-century documents mixed Arabic with Turkish and Persian. Modern Gulf commercial documents mix Arabic and English. The bidi handling, the language-ID-per-token, and the font-class-per-token all have to be correct for downstream extraction to work. Most generic OCR engines treat the whole document as a single language.
What annotation work actually supports better Arabic OCR
When customers ask what kind of labelled data unlocks the next accuracy tier, the answer is rarely “more of the same.” It’s more granular.
- Line-level bounding boxes for segmentation training.
- Word-level bounding boxes for word-segmentation models, with explicit handling of tatweel and embedded Latin tokens.
- Character-level bounding boxes for engines that benefit from glyph-level supervision (transformer-based OCR particularly).
- Ligature labels — explicit annotation of where a ligature begins and ends, with the underlying letter sequence preserved.
- Tashkeel ground truth — diacritized transcriptions of diacritized source images, so a model can be trained to either preserve or strip on demand.
- Script-style classification — labelling whether a passage is Naskh, Riqa, Diwani, Maghribi, Kufic, or Thuluth, so a downstream model can route to the right specialist.
- Per-token language and script ID — every token tagged Arabic, Latin, numeral, or other, so bidi reconstruction and language routing have ground truth.
- Verbatim transcription with normalization variants — the same image transcribed in writer-faithful form, in a normalized form, and in a tashkeel-stripped form. Each downstream task picks the variant it needs.
This is more annotation per image than English OCR typically demands, which is part of why Arabic OCR training data costs more per page. It is also why off-the-shelf engines plateau where they plateau.
Useful datasets
For Arabic OCR practitioners, the public corpora worth knowing about:
- KHATT — a handwritten Arabic text database widely used as a handwriting recognition benchmark.13
- IFN/ENIT — Tunisian handwritten town names; small but well-labelled.14
- MADCAT — DARPA-funded handwritten Arabic document corpus, much larger than KHATT but harder to access.15
- AHTID/MW — Arabic handwritten text image database focused on multi-writer variation.16
- OpenITI — a major corpus of Islamicate texts in digital form, useful for historical training material and language modelling rather than image OCR directly.17
Each of these is necessary but not sufficient for a production pipeline. They get a model to “literature baseline.” Production performance requires customer-specific samples on top.
Practical recommendations for buyers
If you’re scoping an Arabic OCR project — internally or with a vendor — the questions that matter:
- What’s the document class mix? Modern print, real-world scan, modern handwritten, historical manuscript, mixed-language. Each is a different model.
- What’s the production CER target by field? Define accuracy thresholds per field — identity fields are typically held to a much stricter bar than narrative fields. Don’t accept aggregate accuracy numbers.
- Is tashkeel preserved or stripped? Choose deliberately — both are valid, neither is free.
- How is bidi handled? Ask for a worked example with embedded English and numerals.
- What annotation tier are you funding? Pixel-level versus word-level versus character-level versus ligature-aware all cost different amounts and unlock different ceilings.
- Who’s the linguist QA? Native speakers with formal training in Arabic linguistics catch edge cases that pure ML loops miss — particularly around tashkeel, dialect orthography, and historical scripts.
What’s coming
Three trends I expect to shape the next eighteen months. First, multimodal foundation models — GPT-4o Vision, the Mistral OCR line, and the open-weights candidates — are closing the gap on modern printed Arabic faster than the dedicated OCR vendors. Second, the historical-manuscript corner is being pulled forward by digital humanities funding (especially in the Gulf), which means more annotated Naskh, Maghribi, and Diwani samples are entering the public sphere. Third, the dialect-orthography normalization problem is finally being treated as a separate stage in the pipeline rather than baked into the OCR model itself, which I think is correct.
The buyer side hasn’t caught up to the supplier side yet. Most procurement still asks for a single accuracy number and a single per-page price. The actual market is much more segmented than that and the work is much more annotation-heavy than the vendor decks suggest. We’ll keep writing about the segments.
How we help at Annota8
We’re a data-annotation operation, not an OCR vendor. The piece of the pipeline we own is the labelled training and evaluation data that makes a deployed Arabic OCR system land at the customer’s CER target. Our QA tier is built on PhD-level Arabic linguists in Cairo who handle tashkeel, dialect orthography, historical-script identification, and bidi edge cases natively. For modern printed and modern handwritten we typically source domain-matched samples and produce multi-layer annotation (line, word, character, ligature, tashkeel, script class, language ID). For historical manuscript work we partner with academic specialists in the relevant script tradition.
If you’re scoping an Arabic OCR project — banking, healthcare, legal, government, or manuscript digitization — that’s a 30-minute conversation we’d value.
References
Footnotes
-
Maghrebi script — Wikipedia. https://en.wikipedia.org/wiki/Maghrebi_script ↩ ↩2
-
Kufic — Wikipedia. https://en.wikipedia.org/wiki/Kufic ↩ ↩2
-
Diwani — Wikipedia. https://en.wikipedia.org/wiki/Diwani ↩ ↩2
-
Thuluth — Wikipedia. https://en.wikipedia.org/wiki/Thuluth ↩ ↩2
-
Naskh (script) — Wikipedia. https://en.wikipedia.org/wiki/Naskh_(script) ↩ ↩2
-
Arabic alphabet — Wikipedia. https://en.wikipedia.org/wiki/Arabic_alphabet ↩ ↩2
-
Britannica — Arabic alphabet. https://www.britannica.com/topic/Arabic-alphabet ↩ ↩2
-
Microsoft Learn — Developing OpenType Fonts for Arabic Script. https://learn.microsoft.com/en-us/typography/script-development/arabic ↩ ↩2
-
Arabic diacritics — Wikipedia. https://en.wikipedia.org/wiki/Arabic_diacritics ↩ ↩2
-
U+0640 ARABIC TATWEEL — Unicode codepoints.net. https://codepoints.net/U+0640 ↩ ↩2
-
Egyptian Arabic Orthography — Lingualism. https://resources.lingualism.com/egyptian-arabic/egyptian-arabic-orthography/ ↩ ↩2
-
Ruq’ah script — Wikipedia. https://en.wikipedia.org/wiki/Ruq%CA%BFah_script ↩
-
KHATT: An open Arabic offline handwritten text database — ScienceDirect. https://www.sciencedirect.com/science/article/abs/pii/S0031320313003300 ↩
-
IFN/ENIT-database of handwritten Arabic words — IEEE / ResearchGate. https://www.researchgate.net/publication/228904501_IFNENIT-database_of_handwritten_Arabic_words ↩
-
MADCAT — Linguistic Data Consortium. https://www.ldc.upenn.edu/content/madcat ↩
-
A Database for Arabic Handwritten Text Image Recognition and Writer Identification — IEEE / ResearchGate. https://www.researchgate.net/publication/261156375_A_Database_for_Arabic_Handwritten_Text_Image_Recognition_and_Writer_Identification ↩
-
OpenITI corpus — Zenodo. https://zenodo.org/records/10007820 ↩