All posts

How annotation is priced in 2026: a transparent buyer’s guide

Why pricing is opaque

I know this because I bought before I sold. I was a V7, Kognic, and Scale AI customer before founding Annota81, and I noticed a pattern: the proposal starts with a clean per-unit price, then grows through invoices for “rework,” “complex tasks,” “SME hours,” “extra QA review,” “API integration,” “data residency.” The final bill is 2x-4x the initial quote. Not because the vendor lied — because per-unit pricing alone cannot represent real complexity.

In 2026 I’m building Annota8 with a different buyer conversation: here is my math, here are my drivers, show me your use case and I’ll show you the cost breakdown before you sign. This post is that same breakdown — open, no specific quotes, so you can evaluate any vendor’s proposal yourself.

Driver 1: workforce tier

The first and largest cost is the human doing the labeling. There is no single “annotator” — there are at least five tiers, each with a different wage market.

TierWhoHourly wage range (USD, global)Suited for
JuniorBachelor’s degree, 0–2 years annotation experience, calibrated on the platform$4-8Simple bbox, image classification, tagging
SeniorBachelor’s + 2+ years annotation experience, fluent on the platform$8-15Semantic segmentation, standard Arabic NER, QA review
Domain specialistUniversity degree + domain context$15-30Legal NER, financial NER, technical content labeling
PhD linguistLinguistics + guideline design$35-60Guideline design, IAA calibration, corpus audit
Practitioner SMERadiologist, pathologist, lawyer, pharmacist$80-250Sensitive medical/legal/financial labeling, final adjudication

These ranges reflect general industry observation, not a single published wage survey; treat them as ballpark and pressure-test against your local labor market. Important: these are wages, not prices. The vendor price adds QA overhead, management, technology, and margin on top. A healthy typical multiplier is 2-3x wage to per-hour price, higher for hard-to-staff SME tiers.

Question to ask any vendor: “What is the ratio of junior to senior to specialist annotators on my team?” If they cannot answer with numbers, the team is undefined and quality is unpredictable.

Driver 2: QA overhead

Raw annotation is the cheap part. The expensive part is proving it is correct.

The typical structure of serious QA in 2026:

All of this adds cost. Quick math: a task with 1000 produced units + 10% review + 3% PhD calibration + 5% gold-standard = 1180 actual paid units (+18% over raw annotation). This is before escalation and rework costs.

Question for the vendor: “Describe the specific QA structure for my workload. How many reviewers per annotator? What is the review rate? How many gold-standard units injected per week? Who adjudicates escalation queues?”

Driver 3: baseline throughput by task type

This is the biggest pricing trap: buyers assume unit-per-hour throughputs that are unrealistically high. The 2026 reality, as observed in commercial annotation work:

Task typeBaseline throughput (junior-senior)Notes
Image classification (single category)400-800 / hourThe simplest
Bbox (5 categories, medium difficulty)80-200 / hourDepends on category density per frame
Semantic segmentation (multi-category)8-25 / hourThe slowest in computer vision
Instance segmentation5-15 / hourSlower than semantic for dense instances
Arabic NER (standard text)1000-2500 tokens / hourDepends on entity density
Intent classification (narrow domain)150-300 / hourDepends on ontology complexity
RLHF preference pair8-25 pairs / hourDepends on response length and rubric depth
SFT response (open-domain scenario)3-8 / hourDepends on average response length
Audio transcription + diarization (Arabic MSA)0.3-0.6x realtime1 hour of audio = 1.7-3.3 work-hours
Audio transcription + diarization (Gulf dialect)0.2-0.4x realtime1 hour of audio = 2.5-5 work-hours
Audio transcription + diarization (Egyptian code-switched)0.15-0.3x realtime1 hour of audio = 3.3-6.7 work-hours

Ranges are wide because difficulty varies and no single public benchmark covers all of these task types; numbers come from operational experience and should be re-measured on your own data. Any vendor quoting “200 bbox/hour” for every workload, without seeing the data first, is guessing. A smarter buyer question: “Show me 50 units of my actual data and give me a measured throughput estimate after timing with two trained annotators.” That is a transparency test.

Driver 4: modality and difficulty multipliers

The baseline throughputs above assume “medium” difficulty. Reality multiplies:

FactorMultiplier
High entity/instance density (2x average)1.3-1.8x time
Complex ontology (50+ categories)1.5-2.5x time
Class boundary ambiguity1.4-2.0x time
Multi-lingual text in a single unit1.3-1.7x time
High technical content (medical, legal, financial)1.5-2.5x annotator wage
Sensitive content (harmful content moderation, PHI)1.3-1.6x wage + wellness guideline
Precise time coding (frame-by-frame)1.8-3.0x time

Ask the vendor to break the numbers down: “What is throughput for X high-difficulty vs Y medium-difficulty?” If the answer is the same, the price does not reflect actual workload. You will renegotiate later — or quality will fall.

Driver 5: IAA target premium

Inter-annotator agreement (IAA) — usually measured by Cohen’s kappa or Krippendorff’s alpha — is the real quality target2.

IAA targetInterpretationCost premium over baseline
kappa 0.61-0.80Substantial agreement (Landis & Koch) — valid for most production modelsBaseline to +25%
kappa 0.81-1.00Almost perfect agreement (Landis & Koch) — required for medical, legal, financial workloads+40-80%
kappa > 0.90Required for peer-reviewed research and the most sensitive workloads+100-200%

Note: the bands above use the Landis & Koch (1977) interpretation scale3 — 0.00-0.20 slight, 0.21-0.40 fair, 0.41-0.60 moderate, 0.61-0.80 substantial, 0.81-1.00 almost perfect. Production NLP/CV systems typically need kappa above the upper end of “substantial” to be useful at all, so the baseline column starts at 0.61. The cost-premium percentages reflect operational practice rather than a single published benchmark.

The premium comes from: replicated annotation (n=3 instead of n=1), longer calibration loops, more SME adjudication, more guideline iterations. Do not demand kappa > 0.8 if your workload does not need it — you are paying an 80% premium for no reason.

Driver 6: deployment premium

The deployment model fundamentally changes the math:

Deployment modelPremium over multi-tenant SaaS
Multi-tenant SaaSBaseline
SaaS in customer VPC+50-150%
Sovereign tenant (vendor-managed, customer account)+80-200%
On-premise air-gapped+200-500% (+ one-time costs)

The premium comes from: higher per-annotator infrastructure cost, business travel, specialized security, contract management overhead. Saudi Arabia’s PDPL (Royal Decree M/19, in force since 14 September 2023, with the one-year grace period ending 14 September 2024) imposes cross-border transfer conditions and additional safeguards on sensitive categories4; depending on category and transfer mechanism, sovereign tenancy is often the practical compliance path. For general workloads, sovereign is waste.

Driver 7: Arabic and dialect premium

Arabic is not “English with a different alphabet.” It is a language family with real throughput differences.

Language / dialectPremium over American English
American EnglishBaseline
British / Australian English0%
European French+10-15%
Arabic MSA (Fusha)+10-20%
Gulf Arabic+25-40%
Egyptian Arabic+20-35%
Levantine / Maghrebi Arabic+30-50%
Code-switched content (Arabic ↔ English)+25-45%
Script-switched content (Arabic in Latin letters, Arabizi)+35-60%

Reasons for the premium: absence of orthographic standardization, context differences (labeling “bank” requires distinguishing financial institution vs riverbank), shortage of qualified labor at the senior tier, tooling gap (most annotation platforms were designed for English first, with Arabic added later). These premium ranges reflect operational experience, not a single published wage survey.

How to build a unified unit-price math

To evaluate any proposal, build it yourself:

Step 1: Set the per-hour price for your suited workforce mix. Example: 60% senior ($12) + 30% specialist ($22) + 10% PhD ($45) = (0.6 × $12) + (0.3 × $22) + (0.1 × $45) = $7.20 + $6.60 + $4.50 = $18.30/hour weighted.

Step 2: Determine units-per-hour throughput from the tables above × multipliers. Example: Arabic NER standard = 1500 tokens/hour. 1.4x difficulty multiplier for high entity density = 1071 tokens/hour effective.

Step 3: Calculate raw unit price. $18.30 / 1071 = $0.0171 / token.

Step 4: Add QA overhead. +18% (typical) = $0.0202 / token.

Step 5: Add IAA premium. kappa 0.8 = +50% = $0.0302 / token.

Step 6: Add deployment premium. Sovereign tenant = +120% = $0.0665 / token.

Step 7: Add vendor margin. 30-50% typical = $0.086 – $0.100 / token.

Result: for the described workload, expected price is roughly $0.085 – $0.10 / token. Any proposal materially below needs explanation (uncalculated multiplier, low QA). Any proposal materially above needs explanation (margin, exaggerated sovereign premium). The exact numbers depend heavily on your mix — running the same math with a 70/25/5 senior-skewed mix lands at ~$15.45/hr weighted and a ~17% lower per-token price.

Common pitfalls in annotation contracts

Pitfall 1: “Rework” ambiguity. The contract prices per-unit, but every “unsatisfactory” unit is redone at an additional fee. Solution: negotiate an included rework rate in the price (5-10% typical).

Pitfall 2: Unbounded “SME hours.” Adjudication is billed separately at $150-300/hour. Set a cap.

Pitfall 3: API integration and training trips. May be billed separately. Demand all-inclusive.

Pitfall 4: Contract termination costs. Some vendors charge data export or deletion fees. Verify in the contract.

Pitfall 5: Unwritten review rate. “We review everything” means zero commitment. Demand a specific percentage in the contract.

Pitfall 6: Opaque task router. Which task goes to which annotator drives quality. Demand a description of the task router.

What this means for the buyer

I have not quoted specific Annota8 prices in this post. If you want a number for your workload, show me the data, the workload, and deployment requirements, and I will give you the driver breakdown above applied to your specific case — with the same transparency. Browse pricing for general ranges.

References

  1. V7 Labs — official site; Kognic — official site; Scale AI — official site — confirms the three vendors named in the buyer-side experience anecdote exist as commercial annotation/data-services providers.
  1. Krippendorff, K. — “Computing Krippendorff’s Alpha-Reliability” (Annenberg School / University of Pennsylvania repository); Artstein, R. & Poesio, M. — “Inter-Coder Agreement for Computational Linguistics,” Computational Linguistics (MIT Press, 2008) — confirms Cohen’s kappa and Krippendorff’s alpha as the standard inter-annotator agreement metrics used in NLP and computational linguistics.
  1. Landis, J.R. & Koch, G.G., “The Measurement of Observer Agreement for Categorical Data,” Biometrics, Vol. 33, No. 1 (1977), pp. 159–174; reproduced in the AHRQ / NCBI Bookshelf reliability appendix (Table B) — source of the kappa interpretation thresholds (0.00–0.20 slight, 0.21–0.40 fair, 0.41–0.60 moderate, 0.61–0.80 substantial, 0.81–1.00 almost perfect).
  1. Morgan Lewis — “Saudi Arabia Personal Data Protection Law: Transition Period Ends September 14” (Sep 2024); DLA Piper Data Protection Laws of the World — Saudi Arabia; IAPP — “Saudi PDPL’s first anniversary” — confirms PDPL was issued under Royal Decree M/19 (16 Sep 2021), amended 27 Mar 2023, entered into force 14 Sep 2023, with the one-year transition / grace period ending 14 Sep 2024; sources also describe the cross-border transfer framework (adequacy, SCCs, BCRs, certificates of accreditation) and additional safeguards on sensitive personal data categories.
Discuss your workload → pricing breakdown session Browse pricing page