Reward Is All You Need: Four Futures of Reinforcement Learning

December 14, 2025 · 10 min read

Written by Ava Poole, Raina Maiga, Sam Shridhar

Reward Is All You Need: Four Futures of Reinforcement Learning

When the TAM is 'all of human labor,' it may be time to polish the old crystal ball. An analysis of the RL economy, its seven emerging verticals, and four possible futures.

Fighting the Label “Commodity”

When facing uncharted territory, we often analogize technological revolutions to familiar past events: accusations of today’s “AI bubble” are cast in the image of the DotCom bust and the browser battles of the late ’90s are back better than ever with Atlas, Comet, and Dia stepping up to the plate.

Regarding reinforcement learning, the better historical parallel may be the semiconductor wars of the 1970s and 80s. Intel, Texas Instruments, and Motorola raced to miniaturize transistors, birthing Moore’s Law and catalyzing the PC revolution. Now, labs and RLaaS startups race to build interaction loops, scalable environments, and reward systems to architect infrastructure that catalyzes AI’s replacement of white collar labor.

Shifting from Annotation to Agency

In the mid-2010s, companies like Scale AI pioneered third-party data labeling for vision tasks. Most buyers were enterprises that needed structured data for narrow use cases, an example being bounding boxes for self-driving. The transformer revolution (2018-2020) shifted attention toward large-scale supervised pretraining, creating immense demand for high-quality static text. However, as models became more capable, they needed less raw data and more difficult data that involved task structure, tool use, and real-world incentives.

This shift birthed the RL economy.

Where supervised learning is extractive (scraping what already exists), RL is generative: agents explore, interact, and receive feedback in simulated or real environments. This loop of environment, action, reward, policy update is now foundational to advancing LLM behavior, especially in reasoning and tool use.

Here are the verticals emerging in RL, as denoted in our market map.

Environments: Virtual or hybrid simulation layers (gyms, sandboxes) where agents practice decision making and receive feedback loops before deployment.
RLaaS (Reinforcement Learning-as-a-Service): Managed platforms providing tools to train, fine-tune, and evaluate agents using customized reward functions.
Data Marketplaces: Platforms matching human evaluators or datasets to model developers for high-quality labeling, reward modeling, or evaluation data.
Orchestration: Infrastructure that manages multi-model workflows, reward tuning, and distributed RL experiments across environments.
Multi-Agent RL: Frameworks and companies enabling collaborative or competitive learning between multiple agents to improve emergent behaviors.
Physical RL: Real-world systems where RL agents learn from physical interactions, often in robotics, drones, hardware development, or industrial automation.
RL-Enabled Rollups: Consolidators applying automation across acquired companies, accruing operational efficiency by following the private equity rollup playbook with big players here being General Catalyst and Thrive Capital.

Reinforcement Learning Market Map

Now, there exist four main permutations of how the RL race unfolds:

Permutation 1: The Lab Oligopoly

Winning Vertical: Data Marketplaces and RLaaS

Outcomes: RL startups and investors get squeezed out. Enterprises ultimately suffer due to loss of strategic autonomy.

In this dominant trajectory, a handful of hyperscaler labs (OpenAI, Anthropic, DeepMind) become the exclusive buyers of RL services. Labs are increasingly outsourcing reinforcement learning tasks not just for operational leverage, perhaps for financial opacity and scalability. By routing labor intensive evaluation and reward modeling work through external vendors, they effectively keep costs off the books, avoiding direct headcount increases and capitalizing on the VC backed burn of these intermediaries.

In OpenAI’s case, Mercor is subsidizing the real cost of reinforcement loop construction including hiring, training, and managing a global pool of task-specific contractors. Economics wise, Mercor is venture funded and absorbing margin pressure in the hope of future productization. This enables OpenAI to outsource cognition at a discount, because Mercor is underpricing RLaaS to chase growth and signal exclusivity to labs and investors.

Mercor began as a labor sourcing marketplace and evolved into a rewards and eval infra player after OpenAI became its primary customer. Similar to Microsoft bundling Internet Explorer in the ’90s, labs may eventually absorb these players altogether.

In the long run, if RL infra consolidates, labs may absorb these vendors outright, shifting from outsourcing to in-house stacks once vendor pricing no longer beats internal costs. Startups orbit these labs as vendors or contractors. Further, a margin death spiral could ensue. Labs define the reward modeling spec, and vendors underbid each other to implement it faster and cheaper. The more interchangeable the task becomes, the more brutal the race to the bottom.

Permutation 1 will be a net-negative for RL companies and investors, as well as enterprises. While some startups may briefly thrive on hyperscaler contracts, the market will come to face churn and pricing pressure. Labs treat vendors as modular, outsourcing when it’s cheap and acquihiring when it’s not. This creates weak pricing power, high concentration risk, and little room for margin expansion. VCs may see early paper returns, but true defensibility is minimal unless the vendor becomes essential infra.

Enterprises also lose out in this scenario. Labs own the most sophisticated RL infra. This makes it harder for regular companies (banks, logistics firms, retail chains) to build agentic workflows without relying on APIs or black-box evals. The result is a two-tier AI economy of labs with superhuman agents and enterprises as glorified data layers, consumer conduits, and glorified retrieval bots.

Permutation 2: Fragmentation and Specialization

Winning Vertical: Environments and Physical RL

Outcomes: Scattered for RL startups and investors. Wildly beneficial for enterprises.

Continuing with the silicon war analogy, in this permutation, RL infra splinters into dozens of domain-specific or modality-specific specialists, much like how ARM carved out a niche in mobile by optimizing for low power rather than compute power.

RL is naturally modular. Different players optimize for:

Synthetic environments and task engines (Plato, Hud.so, Phinity)
Multi-agent QA and benchmark realism (Verita, Good Start Labs, General Intuition)
Data relabeling and eval platforms (Handshake AI, Isidor, Sepal AI)
Reward modeling (Kaizen, Forge)

Much of this is also talent arbitrage: startups rebrand RLaaS consulting as scalable infrastructure. Top researchers become the product, embedding directly into agent training loops.

Low margins persist unless productization occurs. Vertical winners in finance, healthcare, or insurance (where benchmarks are well-defined) can reflect SaaS business models, but most others remain services-heavy. RLaaS providers resemble specialized defense contractors or regulatory compliance firms: high-touch, slow scale, but durable.

Fragmentation enables more founders to build durable businesses with $100M-$500M runrates, especially in verticals where eval loops are complex. However, margins are thin. Many vendors operate like AI consultancies with a shiny veneer. Investors may be disappointed unless a platform emerges (perhaps a “Figma of Rewards”). Enterprises benefit significantly. This market creates choice, customization, and cost flexibility.

Permutation 3: The Interface Layer, Pivot to Enterprises and Products

Winning Vertical: RLaaS and Orchestration

Outcomes: Best case for RL startups and investors. Positive for enterprises with increased autonomy.

The most optimistic trajectory sees RL generalize beyond labs into mainstream enterprise use, similar to how the browser unlocked global access to the web. In this view, RL becomes a core interface layer for interacting with LLMs across finance, healthcare, legal ops, and logistics.

Enterprises begin as “unsophisticated buyers,” as they were in the Scale AI era. With ROI pressure and competitive urgency, they seek internal workflows with verifiable reward signals. Players like Forge and Applied Compute step in to provide eval loops, environment generation, and reward shaping for enterprise agents.

The total addressable market of RL companies increases by 10-100x as RL serves thousands of medium-scale enterprise buyers instead of a handful of labs. The demand curve flattens, reducing platform risk.

This is the best scenario for product-oriented RL startups and their investors. It allows for horizontal platform plays like Stripe for evals or Datadog for reward observability, plus deep vertical SaaS like RL for supply chain automation or finance workflows. Companies move from “RLaaS as consulting” to “RLaaS as platform.” VCs win big if infra compounds usage, data, and switching costs.

This is the ideal future for enterprise buyers. Enterprises gain agency, own their reward functions, define their environments, and integrate RL into internal KPIs. This leads to smarter internal agents, better workflow automation, and reduced reliance on labs. Over time, RL becomes part of the default dev stack.

Permutation 4: Synthetic Commoditization

Winning Vertical: Orchestration and Multi-Agent RL

Outcomes: RL startups race to the bottom. Enterprises suffer due to lack of customization and oversight.

Labs could shift to fully synthetic training loops via self-play, toolchains, or “agents training agents.” Human labor and third-party RL vendors become unnecessary.

Third-party RL companies get squeezed out. If labs use synthetic environments to train agents internally, there is no reason to pay Mercor, Surge, or Forge. Only open-source platforms or devtools with mass adoption survive. VCs likely lose money unless backing infra with compounding developer utility.

Enterprises would also ultimately suffer. They’d get better APIs and pretrained agents that can do more from labs, but lose sovereignty. They don’t control the environment, the reward model, or the agent’s incentives. This creates explainability and compliance risk. Enterprises are stuck in inference-only mode, unable to fine-tune agents for internal workflows.

As synthetic data replaces humans, orchestration and agent-to-agent training infra will become the key differentiators.

”History Doesn’t Repeat Itself, But Often Rhymes…”

With semiconductors, value eventually concentrated in compute infrastructure: it may have taken 50 years, but NVIDIA is now more valuable than Apple. The power didn’t stay with the device, it migrated to the picks and shovels. That same shift may now repeat in AI: labs like OpenAI, Anthropic, and DeepMind are today’s equivalent of early PC makers; their demand curves are pulling entire ecosystems around them. But in RL, the “silicon” is not compute. It’s human data, environment realism, and reward infrastructure.

More pieces to come on RL. Next up, Maslow’s Hierarchy of Needs and the RL stack.

This piece was written by Ava Poole, Raina Maiga, and Sam Shridhar, with contributions based on research from Anant Gupta, Arrth Mittal, Trevor Phillips, and Chelsea Zhang. Cornell Venture Capital has provided cutting edge research to VCs since 2010.