Medical imaging + Arabic clinical NLP — annotation realities
Medical imaging annotation realities
Modality: radiology (X-ray, CT, MRI, ultrasound)
Annotation tasks:
- Pathology detection (binary or multi-class)
- Pathology localisation (bounding box or segmentation)
- Anatomical landmark identification
- Severity scoring
- Differential diagnosis
Annotator profile: board-certified radiologist preferred for production; senior radiology resident acceptable for some labelling with consultant review.
QA pattern: dual-annotator independent labelling on 10-30% of items; disagreement adjudication by senior radiologist; calibration on gold-standard set monthly.
Modality: pathology (whole-slide imaging, WSI)
Annotation tasks:
- Tumour classification (benign / malignant / dysplasia)
- Tissue type classification
- Cell counting + Ki-67 + mitotic index
- Gleason scoring (prostate)
- HER2 / ER / PR scoring (breast)
- Margin assessment
Annotator profile: board-certified pathologist required for production work; resident pathologist acceptable for tissue type classification with consultant review.
QA pattern: multi-annotator on disagreement-prone categories (Gleason intermediate scores, HER2 borderline)[^1]; regular calibration; pathology AI eval against external reference standards (BACH, CAMELYON, etc.)[^2].
Modality: ophthalmology (fundus photography, OCT)
Annotation tasks:
- Diabetic retinopathy grading
- Glaucoma detection + cup-disc ratio
- AMD (age-related macular degeneration) detection
- Retinal vessel segmentation
Annotator profile: ophthalmologist or trained ophthalmic technician with consultant review.
Modality: dental (panoramic X-ray, CBCT)
Annotation tasks:
- Tooth identification + numbering
- Caries detection
- Periapical lesion detection
- Cyst / tumour detection
Annotator profile: dental specialist or general dentist with experience.
Cross-modality: medical device + procedure
Other imaging-adjacent tasks:
- Endoscopy + colonoscopy lesion detection
- ECG / EEG signal annotation
- Surgical workflow step annotation
- Surgical instrument identification
Arabic clinical NLP annotation realities
Document types
| Type | Language profile |
|---|---|
| Radiology reports | Mostly English in MENA hospitals; sometimes Arabic in Egyptian / North African private |
| Discharge summaries | Mixed Arabic + English; institution-dependent |
| Physician notes (handwritten) | Often Arabic with English medical terms |
| Prescription | Arabic + Latin pharmaceutical names |
| Patient history | Often Arabic dialect |
| Lab results | English with Arabic patient names |
| Consent forms | Arabic with English medical terms |
Annotation tasks
Entity extraction:
- Symptom mentions
- Diagnosis mentions
- Medication mentions
- Anatomical references
- Lab values
- Vital signs
- Procedure references
Relation extraction:
- Symptom-diagnosis relations
- Medication-condition relations
- Procedure-outcome relations
Concept normalisation:
- ICD-10 mapping from Arabic clinical text[^3]
- SNOMED-CT mapping[^3]
- LOINC mapping[^3]
- RxNorm mapping for medications[^4] (in GCC contexts, complement with WHO ATC and national formulary codes where RxNorm coverage of regional brand names is partial)
Privacy:
- PHI detection + de-identification
- Patient identifier extraction (for de-id)
- Provider identifier handling
Annotator profile for clinical NLP
For production:
- Arabic-fluent physician or medical student (preferred)
- Clinical pharmacy graduate for pharmacy-heavy work
- Medical translator with NLP training (acceptable with consultant review)
For QA:
- Board-certified Arabic-fluent physician + clinical informatics SME
Cross-script medication handling
Pharmaceuticals appear in MENA prescriptions as[^5]:
- Brand name in Latin script: “Augmentin 625mg”
- Generic name in Latin script: “amoxicillin/clavulanate 625mg”
- Brand name in Arabic transliteration: “أوغمنتين”
- Generic name in Arabic: “أموكسيسيلين كلافولانات”
- Mixed: “أوغمنتين 625 ملغ”
- Handwritten: physician shorthand + abbreviation
A reliable medication extraction pipeline handles all of these + normalises to RxNorm[^4].
Privacy + compliance realities
PDPL health data residency
Saudi PDPL classifies health data as a sensitive/restricted category[^6]:
- In practice, health-data processing is typically kept in-Kingdom; cross-border transfer requires PDPL Article 29 routes (adequacy decision, SCCs, BCRs, or Certificate of Accreditation) plus a risk assessment under the Aug-2024 Personal Data Transfer Outside the Kingdom Regulations[^7]
- Cross-border transfer requires an SDAIA-recognised lawful basis (per-transfer SDAIA pre-approval is not always required, but SDAIA risk assessment is mandatory for continuous/large-scale sensitive-data transfers)[^7]
- Sub-processor must flow PDPL obligations down (controllers must contractually bind processors to equivalent safeguards, and processors must bind sub-processors equivalently)[^8]
- Data subject rights workflows required (rights to be informed, access, correction, destruction under PDPL Chapter 4)[^9]
- DPO designation required where the controller’s core activity is processing sensitive (incl. health) data, per PDPL Executive Regulations Article 32[^10]
HIPAA BAA for US-bound workloads
For MENA hospitals serving US patients or doing US clinical research collaboration:
- Business Associate Agreement (BAA) required under 45 CFR 164.502(e) / 164.504(e) for any business associate that creates, receives, maintains, or transmits PHI[^11]
- HIPAA controls flow down to sub-processors (subcontractors of business associates are directly subject to HIPAA under HITECH 2009 + Omnibus Rule 2013 + 45 CFR 164.308(b), with equivalent BAA terms downstream)[^12]
- Many US health systems contractually require US-only PHI processing even though HIPAA itself is silent on data residency[^13]
De-identification before annotation (or during)
Three patterns:
-
Pre-annotation de-identification: customer de-identifies before sending. Annotators see only de-identified content. Easiest compliance but customer effort.
-
In-pipeline de-identification: annotation platform de-identifies as part of intake. Annotators see de-identified version. Vendor effort but customer convenience.
-
No de-identification: identifiable data annotated by cleared workforce under BAA + appropriate controls. Highest sensitivity workflow.
Most engagements use pattern 1 or 2.
Common pitfalls
Pitfall 1: Crowd-sourced clinical annotation
“Medical-trained annotators” without board-certified QA produce clinically unsafe output. Senior physician review is non-negotiable.
Pitfall 2: Cross-script medication confusion
Medication extraction pipelines that don’t handle Arabic + Latin + transliteration + handwritten produce massive false-negative rates on MENA data.
Pitfall 3: Translation as a substitute for native Arabic clinical NLP
Translating Arabic clinical text to English, then doing English NLP, loses information (dialect, register, code-switching, regional terms)[^5]. Native Arabic clinical NLP is required for serious work.
Pitfall 4: Ignoring de-identification
Annotators processing identifiable PHI without proper controls + BAA exposes the customer + vendor to regulatory action. De-id workflow is part of standard MENA medical annotation.
Pitfall 5: Cross-border without lawful basis
US-hosted annotation of KSA patient data without sovereign tenancy + PDPL alignment is direct regulatory exposure for the customer.
Pitfall 6: Underestimating SME cost
“We’ll use Arabic-speaking annotators with general medical training” produces low-quality output. Board-certified Arabic-fluent physicians cost more for a reason.
Where Annota8 fits
Annota8 is being designed for MENA medical AI workloads. Capability targets, scoped per engagement:
- Board-certified clinician SME tier (radiology, pathology, ophthalmology, dentistry)
- Arabic-fluent physician clinical NLP layer
- ICD-10 / SNOMED-CT / LOINC / RxNorm mapping from Arabic clinical text
- Cross-script medication extraction (Arabic + Latin + transliteration + handwritten)
- PDPL-aware design — in-Kingdom processing for KSA patient data
- HIPAA BAA path for US-bound workloads — BAA execution is engagement-specific
- De-identification workflow — pre-annotation or in-pipeline
- Sovereign tenancy patterns for hospitals + research centres
- On-premise option for sensitive workloads where the customer requires it
Annota8 is in early-stage operations and does not hold formal medical-data compliance certifications today[^14]; we engage on a controls-mapping basis with the customer’s compliance team.