The IAA crisis in Arabic AI eval — why standard kappa breaks
A quick refresher — and why it works most of the time
Three statistics dominate IAA reporting:
- Cohen’s κ — agreement between exactly two annotators on categorical labels, corrected for chance agreement.1
- Fleiss’ κ — generalization to N annotators, where any N annotators may label any item.2
- Krippendorff’s α — handles missing data, ordinal/interval/ratio scales, and any number of annotators. The most general of the three.3
All three share a structural assumption: disagreement is noise around a single ground truth.4 The chance-agreement correction subtracts what random labeling would produce, leaving what we hope is signal: the rate at which annotators converge on the right answer.
This works beautifully when the task is clean — segmenting a tumor boundary on a CT scan, classifying a transaction as “consumer goods” vs “industrial equipment”, matching a name against an OFAC sanctions list. The categories are well-defined; the chance baseline is predictable; reasonable annotators converge.
It also works when the task has high cardinality but stable definitions — bounding boxes on cars, lane markings, pedestrians. We use Cohen’s κ as the default health check on most of our queues. It is the right tool when its assumptions hold.
Where it breaks on Arabic — four diagnostic patterns
Pattern 1 — Dialect identification (gradient, not categorical)
Ask three Egyptian annotators to label whether a recording is Cairene, Alexandrian, or Saidi. The phonetic boundary between Cairene and Alexandrian is not a wall; it is a gradient that moves with the speaker’s age, education, time spent in each city, and the formality of the recording. A 38-year-old speaker who grew up in Alexandria but has lived in Cairo for 14 years sits on the boundary by construction.
What Cohen’s κ sees: three annotators “disagree.” What is actually happening: three annotators are reporting different points on a continuous distribution, and the question “is this Cairene or Alexandrian” has no categorical answer to begin with.5 Per-annotator dialect comfort makes it worse — a Cairene annotator may hear residual Alexandrian features that a Saidi annotator does not.
In our experience, this pattern produces κ values that the standard playbook reads as “annotators are unreliable, retrain or replace.” The annotators are correct. The metric is wrong.
Pattern 2 — Sentiment with cultural context
In illustrative internal sentiment evaluations across Egyptian, KSA Najdi, and Levantine annotators on the same tweets and guidelines, we see a consistent shape: cross-dialect κ between Egyptian and Levantine annotators sits noticeably higher than κ between Egyptian and KSA Najdi annotators — the latter typically concerning by standard reading.
The breakdown shows the pattern: Egyptian annotators read sarcasm aggressively (the default complaint register, as covered in the dialect-sentiment piece). KSA Najdi annotators read the same sarcasm as polite restraint and label it neutral. Levantine annotators read humor where Egyptians read frustration.6 The disagreement is systematic by demographic, not random. A single aggregate κ averages these systematic shifts together and produces a meaningless number.
Pattern 3 — Quranic recitation correctness (Tajweed)
Tajweed — the rules of correct Quranic recitation — is governed by multiple recognized schools (the qira’at) with codified, principled differences in vowel length, assimilation, and articulation. An annotator from the Hafs tradition and an annotator from the Warsh tradition will disagree on specific recitation segments and both will be correct under their respective schools.7
Standard κ treats this as noise. It is not noise. It is principled disagreement with a clear taxonomy. Forcing the metric anyway pushes operators toward the wrong remedy — adjudicate, retrain, replace — when the right remedy is to record which school each annotator follows and report agreement within and across schools separately.
Pattern 4 — Religious sensitivity grading
A multi-confessional pool — Sunni, Shia, Coptic, Maronite annotators — grading whether a piece of content is “religiously sensitive” or “appropriate” routinely produces depressed κ on the contested items in our operational experience. A Sunni annotator and a Shia annotator may legitimately disagree about whether referring to a particular historical figure with a specific honorific is neutral or charged. A Coptic and a Maronite annotator may disagree about whether a depiction of a saint crosses a line.
These are not labeling errors. They are theological positions. The metric, asked to summarize them, reports “low quality.” The work is fine; the lens is wrong.
The symptoms that follow
When the metric is wrong but the team trusts it:
- Artificially low κ headlines. Customer-facing dashboards show depressed aggregate κ on dialect or sentiment work; the customer reads “your annotation quality is poor” and demands a remediation plan. The plan does not fix the problem because the problem is not annotation quality.
- False drift signals. Week-over-week κ slips by several points — the team flags drift, opens an investigation, pulls a senior linguist off active work. The underlying cause is a demographic shift in the active annotator pool, not a quality issue.
- Expensive over-adjudication. Items below the κ threshold get routed to senior review. On principled-disagreement tasks, that means routing the majority of items, burning adjudication budget on disputes that have no resolution.
- Annotator attrition. Annotators told they are “below threshold” on tasks where they are demonstrably correct quit. Good Egyptian linguists and Tajweed-trained annotators are scarce; losing them to a bad metric is an unforced error.
The better metric stack
We do not abandon κ. We layer.
Disagreement-decomposed κ
Decompose total disagreement into systematic (annotator A leans positive, annotator B leans neutral, every time) and random (independent noise around a true label). A standard κ blends them; a decomposed view surfaces the systematic component as its own signal — a hint that the task may need stratification or a guideline clarification rather than retraining. (This is a methodology family — weighted κ for ordinal data, Krippendorff α partitioned by metric, hierarchical extensions — rather than a single canonical statistic.)8
Per-demographic-stratum κ
Compute κ within each demographic stratum separately, then report a panel: Egyptian-Cairene annotators vs each other, KSA Riyadh annotators vs each other, Levantine annotators vs each other. A within-stratum κ of 0.78 with a cross-stratum κ of 0.49 is a clean diagnosis — annotators within their own cultural context are reliable; the disagreement is across cultural contexts and is principled.
Item-level confidence with selective adjudication
Instead of a blanket κ threshold over the whole batch, score each item’s annotator-level disagreement and route only the contested items to adjudication. In our experience, the per-item view on a typical batch separates cleanly into a large head of unanimous items, a middle band of partial agreement, and a small tail of meaningful disagreement — and the tail is where senior time should go, not the whole batch the threshold rule would otherwise pull in.
Bayesian rater models — Dawid-Skene, MACE
When the task is dominated by principled disagreement, the right framing is to estimate each annotator’s reliability and bias as latent parameters and infer the consensus label jointly. The Dawid-Skene EM model9 and MACE (Multi-Annotator Competence Estimation)10 are the standard reference implementations. Dawid-Skene gives a per-annotator confusion matrix; MACE gives a per-annotator competence parameter; both yield item-level posteriors and an annotator-reliability score that does not collapse principled disagreement into “low quality.”
These models are tractable at the batch sizes we run in production today — the reason they are not the default is not compute; it is that they are harder to explain on a customer dashboard than a single κ number.
Soft labels
For inherently fuzzy categories — dialect on a gradient, sentiment intensity, religious sensitivity by audience — stop forcing hard categorical labels. Collect probability distributions per category (an annotator says “70% Cairene, 25% Alexandrian, 5% Saidi”) and evaluate models against the distribution, not a single mode. Soft labels carry the uncertainty that the task actually contains; hard labels throw it away and then complain that κ is low.
How Annota8 routes between them
Our default by task class:
- Bounding boxes, sanctions matches, billing classification, named entity recognition on news, OCR confidence: Cohen’s κ or Fleiss’ κ as the default. Threshold gates work. Aggregate dashboards are honest.
- Dialect identification, sentiment with cultural context: demographic-stratified κ. Within-stratum + cross-stratum reported separately. The customer sees both.
- Tajweed correctness, religious sensitivity, principled-disagreement labeling: Bayesian rater model (Dawid-Skene or MACE depending on label structure). Per-annotator reliability surfaced. Soft labels collected upstream.
- Any task showing systematic demographic disagreement: disagreement-decomposed κ as a diagnostic, escalated to stratified or Bayesian reporting if the systematic component dominates.
The customer dashboard surfaces per-category and per-stratum IAA, not just aggregate. The success-metrics dashboard carries the panel; the quarterly business review walks through it explicitly.
This complicates customer reporting — that is the honest cost. A single κ headline is easier to ship to a procurement deck. A panel of stratified statistics needs a 90-second explanation. We pay that cost when the task warrants it. On unambiguous tasks we keep the simple number, because the simple number is correct.
What we will not do
We will not report a single aggregate κ on a task where we know the disagreement is principled. We will not adjudicate principled disagreement into a fake consensus to make a metric look healthier. We will not optimize annotator selection against a metric that punishes correct theological or dialectal positions. And we will not let a customer’s procurement template override the right metric for the task — we will explain the panel and ship the right numbers.
The IAA stack matters because everything downstream rests on it. The worker-reliability score, the calibration set, the golden batch, the dispute-resolution workflow — all of it inherits whatever the IAA metric calls “quality.” If the metric is wrong on Arabic tasks, every downstream control is wrong with it.
Closing read
Inter-annotator agreement on Arabic AI tasks is not a “tighten the guidelines” problem. It is a metric-selection problem. The teams that get Arabic eval right over the next two years will not be the ones with the highest κ on dashboards; they will be the ones honest enough to use κ where it works, stratify where stratification matters, and switch to Bayesian rater models where disagreement is principled. The rest is theater — a single number that flatters procurement and lies about the work.
References
Footnotes
-
Cohen, J. (1960). “A Coefficient of Agreement for Nominal Scales.” Educational and Psychological Measurement, 20(1), 37-46. Reference summary: https://en.wikipedia.org/wiki/Cohen%27s_kappa. ↩
-
Fleiss, J.L. (1971). “Measuring Nominal Scale Agreement Among Many Raters.” Psychological Bulletin. Reference summary: https://en.wikipedia.org/wiki/Fleiss%27_kappa. Strictly, Fleiss’s kappa at N=2 reduces to Scott’s pi (1955) rather than Cohen’s kappa; the “generalization to N annotators” framing is the standard shorthand used in the literature. ↩
-
Krippendorff, K. Canonical references summarized at https://en.wikipedia.org/wiki/Krippendorff%27s_alpha and the SAGE Encyclopedia of Communication Research Methods entry on intercoder reliability techniques. ↩
-
Uma, A. et al. (2021). “Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning.” NAACL. https://aclanthology.org/2021.naacl-main.204.pdf. See also Aroyo & Welty (2015) on the “Crowd Truth” framework and Plank (2022) position papers reframing human label variation as signal rather than noise. ↩
-
Keleg, S. et al. (2024). “Estimating the Level of Dialectness Predicts Interannotator Agreement in Multi-dialect Arabic Datasets.” https://arxiv.org/pdf/2405.11282. The paper finds that “samples with higher dialectness scores are harder to label” and that IAA varies substantially across dialect groups. ↩
-
Keleg et al. (2024), as above: “certain religious expressions used in the Levant to express positive sentiment are not considered to have sentiment in other Arab regions like the Gulf countries. Arabic-speaking annotators are harsher in labeling hate speech and less capable of identifying sarcasm when annotating samples written in dialects that the annotators do not speak.” ↩
-
On the canonical qira’at and the codified differences between Hafs an Asim and Warsh an Nafi (madd, imalah, hamzah pronunciation, certain consonants), see overview at https://en.wikipedia.org/wiki/Warsh and accessible scholarly summaries at the al-Walid Academy and Mubarak Academy. The extrapolation from qira’at scholarship to annotation-pipeline IAA is the author’s analytical contribution rather than a cited empirical result. ↩
-
Decomposition of disagreement into systematic (bias) and random components is standard in measurement theory — see weighted kappa for ordinal data (Cohen, 1968), Krippendorff’s alpha partitioned by metric, and the clinical biostatistics literature on bias-prevalence-adjusted kappa. Reference: https://en.wikipedia.org/wiki/Cohen%27s_kappa (weighted kappa section). ↩
-
Dawid, A.P. & Skene, A.M. (1979). “Maximum Likelihood Estimation of Observer Error-Rates Using the EM Algorithm.” Journal of the Royal Statistical Society, Series C (Applied Statistics), 28(1): 20-28. https://ideas.repec.org/a/bla/jorssc/v28y1979i1p20-28.html. Bayesian and hierarchical extensions are common in current practice. ↩
-
Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., Hovy, E. (2013). “Learning Whom to Trust with MACE.” NAACL 2013. Paper: https://www.semanticscholar.org/paper/Learning-Whom-to-Trust-with-MACE-Hovy-Berg-Kirkpatrick/624a5c97be5d3ec63d48c34db25726008e5d92a4. Reference implementation: https://github.com/dirkhovy/MACE. ↩