Most coverage of Meta's $15 billion stake in Scale AI in early 2025 has treated it primarily as an acquihire, another founder joining a superintelligence team earning packages in excess of $1 billion each to chase artificial general intelligence (AGI).
That’s accurate but incomplete. The deal marks a more profound strategic shift: Meta is securing control over a supply chain that now determines which labs can ship frontier models on schedule and which get stuck.
This isn’t the supply chain of silicon dominated by NVIDIA and TSMC, or the supply chain of electricity powering hyperscale data centers. It’s an emerging layer of infrastructure built to coordinate and aggregate human expert judgment: the ability to evaluate model outputs, design feedback rubrics and score reasoning chains across specialized domains.
Table of Contents
- Judgment as the New Constraint
- The 'White-Collarization' of Labeling
- Moving Beyond Simple Tasks
- What Expert Labeling Actually Looks Like
- Adoption by Legacy Firms, New Startup Entrants
- How Regulation Is Reframing the Market
- Tensions in the Judgment Supply Chain
- Strategic Implications: Temporary Moats, Real Stakes
Judgment as the New Constraint
For a decade, AI economics has assumed data abundance; the internet seemed infinite. That assumption is fraying. Research from Epoch AI projects that if dataset growth continues at current rates, models could exhaust the global supply of publicly available human-generated text between 2026 and 2032, perhaps earlier if overtraining accelerates. Ilya Sutskever’s 2024 declaration at the Conference on Neural Information Processing Systems that pretraining "will end" captured the growing recognition that the web’s training value is largely exhausted. Retraining on recycled datasets or an LLM-saturated internet is already yielding diminishing returns.
As high-quality text data runs out, the constraint is shifting from compute capacity to something more challenging to scale: expert human judgment. The response has split in two directions:
- Synthetic data, where models generate new training material. This has seen much success, though research shows this approach requires domain-specific expertise to enhance realism and avoid inheriting the generating model’s limitations.
- Expert-generated labeling, where domain experts evaluate and guide model behavior. Human feedback is costly and slow, but remains the only reliable way to teach models to make domain-specific trade-offs, interpret ambiguity and apply context the way an expert would.
While data center and energy buildouts have commanded attention and investment, the models running inside that infrastructure remain perpetually hungry for new, high-fidelity feedback. The bottleneck is moving toward a different resource entirely.
Related Article: Are AI Models Running Out of Training Data?
The 'White-Collarization' of Labeling
A new class of companies has emerged to address that constraint. Unlike earlier annotation platforms that relied on low-wage contractors to tag images or transcribe audio, these firms recruit credentialed professionals to design evaluation rubrics, score complex reasoning chains and generate the preference data that shapes frontier-model behavior.
Companies such as Outlier, Mercor and Invisible now hire PhDs and practitioners at rates ranging from $50 to $120 per hour. Scale AI, founded in 2016 as an image-annotation platform, evolved into the largest independent provider of this work. Reported revenue topped $1.4 billion by 2024, with clients including OpenAI, Anthropic and nearly every frontier lab. Meta’s stake can be seen as infrastructure insurance, a hedge against dependence on external feedback pipelines.
The economics are straightforward: if you’re training a medical-reasoning model, feedback from licensed physicians is exponentially more valuable (and more defensible) than scraped blog posts or generic rater input. As models improve, the bar for helpful feedback rises. A system that performs at an undergraduate level needs graduate-level correction to advance. The result is a self-reinforcing market for progressively more specialized evaluation, with each generation of labelers effectively training models to emulate their own expertise.
Moving Beyond Simple Tasks
The work itself has evolved beyond simple ranking. Early reinforcement-learning tasks asked raters to pick the better of two responses. Today, evaluators score outputs across multiple axes and design rubrics that define “good” reasoning within their domains. Techniques such as Step-DPO, which optimize reasoning at each step rather than only the final answer, require experts who can identify where a model’s logic goes wrong.
This higher degree of insight is allowing some of these labeling firms to better understand the capabilities of frontier models themselves. For example:
- Mercor has built the AI Productivity Index (APEX) to develop benchmarks that evaluate models based on “economically valuable knowledge work” in fields like investment banking, law and medicine.
- Labelbox has released an Evaluation Studio offering continuous, real-time insights into model behavior across edge cases.
These bottom-up informed analyses provide an alternative insight into model benchmarks compared to traditional methods, such as crowdsourced evaluations.
What Expert Labeling Actually Looks Like
Feedback from experts is taking many forms:
- Medicine: Clinicians review a model’s proposed differential diagnoses, flag reasoning gaps (“missed comorbidity with elevated troponin”) and annotate each decision point for plausibility and safety.
- Law: Attorneys compare model-generated contract clauses against statutory templates, scoring them on enforceability and risk exposure; their annotations feed a reward model for “legal soundness.”
- Finance: Former analysts audit synthetic financial reports, labeling inconsistent cash-flow logic or unrealistic assumptions in Monte Carlo simulations.
- STEM education: PhD graders evaluate step-by-step physics proofs, verifying that intermediate derivations obey conservation laws, not just checking final answers.
These micro-tasks are what “human feedback” now means in practice: not tagging cats, but assessing reasoning. The specialized nature of this work is reshaping the industry's structure.
Related Article: The Benchmark Trap: Why AI’s Favorite Metrics Might Be Misleading Us
Adoption by Legacy Firms, New Startup Entrants
Existing service firms and new startups are following the opportunity. Cognizant and Accenture have launched human-in-the-loop divisions. Other changes include:
- Surge AI, previously bootstrapped, is exploring a $1 billion raise to scale infrastructure and compliance capacity.
- Handshake is leveraging its university network to channel graduates into evaluation work.
- Invisible, once a general workforce platform, has pivoted entirely to AI evaluation and now manages tens of thousands of expert raters.
- Newer entrants such as Sepal AI and Micro1 are also rapidly finding traction with frontier labs.
The shift is visible in workforce composition. In late 2024, xAI laid off about 500 workers from its general annotation team (roughly one-third of its 1,500-person unit) while announcing a “strategic pivot” to prioritize specialist AI tutors over generalist roles. The labor market behind AI is beginning to resemble professional services: specialized, margin-rich and increasingly global.
How Regulation Is Reframing the Market
Rapid professionalization is now meeting a new force: regulation and its incentives. As the EU AI Act’s general-purpose model rules continue to roll out through 2026, provenance and auditability are becoming design requirements rather than paperwork. Labs deploying in Europe must document where training and feedback data originate, how evaluators are vetted and what safeguards to prevent bias or leakage. High-risk AI systems face strict obligations around risk assessment and the use of high-quality datasets to minimize discriminatory outcomes.
Compliance favors incumbents such as Scale AI, which already maintain audit trails and demographic tracking, while smaller vendors face steep costs. Yet the same rules create space for regional alternatives as governments and local firms position themselves as compliant, residency-guaranteed options.
Regulation itself is functioning like a temporary moat. Spending $100 per hour on auditable human judgment is cheaper than a compliance failure that halts deployment. The metric is shifting from cost per label to reliability per label. Vendor selection increasingly starts with governance capability, not price.
Tensions in the Judgment Supply Chain
Even as the feedback economy stabilizes, several tensions remain unresolved.
Security Exposure
Thousands of contractors handle pre-release data and evaluation rubrics that reveal model capabilities and safety boundaries. In mid-2025, Business Insider reported a temporary leak of Scale AI documentation, prompting audits and a 14% staff reduction. The incident reminds labs that operational trust is as important a differentiator as accuracy.
Liability
Expert labeling introduces a new risk. When medical annotation is outsourced to certified physicians, diagnostic errors create potential malpractice exposure. Financial and legal experts reviewing model outputs may inadvertently share trade secrets or proprietary methods. Mercor’s CEO has acknowledged the challenge at TechCrunch Disrupt 2025: while contractors are told not to upload documents from former workplaces, “there are things that happen” given the scale of operations. Case law defining responsibility boundaries is still emerging.
Labor Stability
The shift from $3-per-hour crowdwork to $80-per-hour contracts has improved pay but not security. Most evaluators remain freelancers without benefits or career paths. As labs celebrate “human feedback” as essential infrastructure, they have yet to decide what a sustainable labor model looks like. The instability that characterized earlier annotation work — short contracts, no job security, what some critics have called “AI sweatshops” — echoes even at higher wage levels.
Quality Consistency
Even expert-level annotation faces systematic challenges. Data inconsistency and labeling errors often stem not from individual mistakes but from flawed pipelines — problems that persist regardless of credentials. Ensuring quality across thousands of expert evaluators remains an operational hurdle.
Geographic Skew
Expert labor clusters in regions with high concentrations of credentialed professionals, meaning the judgments guiding frontier models reflect a narrow cultural band. A legal-reasoning model trained mostly on US attorneys embeds US legal logic; customer-service models tuned by native English speakers optimize for linguistic norms that may not transfer globally. Companies recruit for diversity, but economics and operational complexity still create clustering, opening space for regional competitors to position themselves as sovereign-AI alternatives aligned with local contexts and regulatory frameworks.
Related Article: No Moat, No Problem? Predictions on Future AI Competition
Strategic Implications: Temporary Moats, Real Stakes
Every serious AI deployment now depends on human feedback. The strategic question is whether to own that loop or rent it, and what that choice reveals about where a company believes defensibility lies.
- Owning the feedback loop creates a supply-side moat
- Vendor partnerships create a distribution moat
- Regulatory mastery forms a compliance moat
Each offers protection for a time, none permanently. Procurement metrics are evolving accordingly: leading buyers link payments to tangible model-quality gains (such as reduced hallucinations and faster fine-tuning), aligning incentives with outcomes.
Compute is abundant. Judgment isn’t. And now, that scarcity defines advantage. The decade ahead is likely to hinge less on chip races and more on data provenance, evaluation quality and the institutional knowledge embedded in how organizations apply human judgment.