All posts

RLHF preference data for Arabic LLMs — building data that actually aligns

What RLHF preference data actually is

For each prompt:

  1. Model generates 2 (or N) candidate responses
  2. Human annotator ranks them: which is better?
  3. A reward model learns to predict the preference signal
  4. The base model is fine-tuned to maximise reward model predictions[^1]

This loop produces models that respond the way humans want — at least the humans who did the ranking.

Modern variants (DPO, Constitutional AI, RLAIF) automate or modify parts of this, but the core dependency on high-quality preference data remains.[^2]

Why translated English preference data fails for Arabic

Problem 1: Cultural alignment is implicit

English preference data, even when translated to Arabic, encodes English-speaking cultural norms:

A model trained on translated English preferences will sound culturally American in Arabic.

Problem 2: Religious sensitivity is mis-calibrated

Islamic religious sensitivity has specific characteristics:

Western-trained annotators rarely calibrate these correctly. The result: a model that gives religiously inappropriate responses.

Problem 3: Family + gender appropriateness differs

Arabic cultural appropriateness around family + gender includes:

Problem 4: Regional political context

The MENA political context includes:

Models trained without explicit MENA political calibration produce responses that offend buyers + users.

Problem 5: Register + dialect appropriateness

Formal MSA vs informal dialect appropriateness differs by context:

A model that always responds in MSA feels stiff to a dialect-speaking user. A model that always responds in dialect feels inappropriate in formal contexts.

What good Arabic RLHF preference data looks like

Component 1: Native Arabic prompts + responses

Don’t translate. Generate prompts natively in Arabic, generate responses in Arabic. The cultural signal is in the language.

Component 2: Annotator calibration

Annotators trained on:

Calibration is anchored by worked examples and ongoing inter-annotator-agreement tracking; specific volumes are scoped per engagement.

Component 3: Multi-annotator agreement on hard cases

For culturally-loaded prompts, use 3-5 annotators per item. Track agreement. Adjudicate disagreements via senior annotator + cultural domain expert (where relevant, religious scholar for religious topics).

Component 4: Adversarial / red-team subset

Explicit subset of prompts designed to test model alignment failures:

This subset catches alignment failures before deployment; sizing is set per engagement based on the buyer’s risk profile.

Component 5: Dialect-aware response evaluation

For each prompt, the appropriate response register may differ:

Annotators must evaluate response appropriateness on register match, not just content quality.

Component 6: Multi-cultural calibration where customer base is MENA-wide

A KSA-only buyer’s RLHF data should calibrate to KSA cultural norms. A pan-MENA buyer (foundation-model lab serving the region) needs:

This is meaningfully more expensive but produces a model that works across MENA.

Common pitfalls

Pitfall 1: Crowd-sourcing without cultural calibration

“Arabic-speaking annotators” without explicit cultural calibration produces inconsistent preferences. The model learns inconsistency.

Pitfall 2: Single-annotator preference labelling

For culturally-loaded prompts, single-annotator labels embed that annotator’s biases. Multi-annotator + adjudication is non-negotiable for serious work.

Pitfall 3: Ignoring religious sensitivity

Models that produce religiously inappropriate responses cause brand damage + customer churn + regulatory exposure (KSA + UAE both have content laws).[^3]

Pitfall 4: One-size-fits-all MSA responses

A model that responds in MSA to dialect-speaking customers feels robotic. Dialect register matching matters.

Pitfall 5: No adversarial subset

Without explicit adversarial prompts, alignment failures only surface in production. By then, users see them.

Pitfall 6: Treating RLHF as one-time

Cultural + political context evolves. A model aligned in 2024 may produce inappropriate responses to 2026 events. Ongoing RLHF iteration is part of responsible deployment.

Where Annota8 fits

Annota8 builds Arabic RLHF preference data with all six components:

See Solutions: foundation-model labs for engagement structures.

FAQ

Can I use translated ShareGPT preferences for Arabic alignment?
Not reliably. Translated preferences encode English cultural norms. The model will sound culturally American in Arabic. Native Arabic preference annotation is required for serious alignment work.
How many preference rankings do I need?
Volume is scoped per engagement against the buyer's alignment target. Volume bands depend on base-model size, domain coverage, and how much of the data is dialect-stratified or culturally-calibrated.
What's the cost difference between native Arabic RLHF + crowd-sourced?
Native Arabic PhD-calibrated preference annotation is materially more expensive than commodity crowd-sourcing, and the alignment quality gap is the reason buyers pay for it — translated and crowd-only pipelines produce brand-damaging misaligned Arabic models. Exact ratios are scoped per engagement.
Is Annota8 designed for DPO + Constitutional AI + RLAIF?
Yes — the data shape is the same across DPO, Constitutional AI, and RLAIF.[^4] The exact pipeline is scoped per engagement; we do not pre-promise turnkey delivery for any specific RL training framework.
Can Annota8 bring in religious scholar consultation for Islamic content?
On a per-engagement basis, yes — for prompts touching Islamic religious topics we can scope a religiously-qualified annotator panel and bring in Shari'ah-scholar consultation, including AAOIFI standards expertise where the content touches Islamic finance.[^5] Volume, sensitivity, and timeline are agreed in writing per engagement; Annota8 does not issue religious rulings.
Discuss Arabic RLHF data → 30-min session Read foundation-model solutions