The Second Half of AI: A New Era of Utility, Evaluation, and Reinforcement Learning
Artificial intelligence has reached what can be called its halftime. The first half was marked by an era of rapid progress in training novel models and methods, enabling AI systems to achieve superhuman performance on classic benchmarks like chess, Go, SAT exams, and even complex programming contests. We witnessed landmark milestones such as DeepBlue, AlphaGo, GPT-4, and the impressive “o-series” of models. These breakthroughs happened through fundamental innovations in AI methods: search algorithms, deep reinforcement learning (RL), model scaling, and sophisticated reasoning.
But what’s different now? The game has shifted dramatically with the arrival of a profound new insight:
Reinforcement Learning Finally Generalizes
In three simple words: RL finally works. More precisely, RL finally generalizes. This statement captures a breakthrough that reverberates throughout AI research and applications.
Historically, RL methods excelled at narrow, isolated tasks but failed to transfer across domains or scale to complex real-world problems. However, after decades of research and cumulative milestones, a unified recipe has emerged. This recipe combines:
- Massive language pretraining (priors): Leveraging vast text corpora to instill broad knowledge and reasoning skills.
- Scale: Harnessing large-scale models and data to tackle complexity.
- Reasoning as action inside an RL loop: Treating reasoning steps as decision-making actions within reinforcement learning, enabling sequential, goal-directed problem solving.
This powerful combination now tackles a surprisingly wide range of RL tasks in fields that once seemed distinct — software engineering, creative writing, mouse-and-keyboard manipulation, long-form question answering, and even high-level math competitions. Tasks once thought to require specialized separate algorithms are now solved within this general RL framework.
The Shift: From Problem Solving to Problem Definition
While the first half celebrated climbing benchmark scores, the second half shifts the focus to defining the right problems: those that truly matter in the real world.
The core challenge emerges from a realization: benchmarks, as traditionally developed, no longer capture real utility effectively. Increasingly harder benchmarks will be solved swiftly by the standard RL recipe, producing diminishing returns in meaningful value.
Thus scoring well on artificial tasks no longer guarantees real-world success or impact. The central question becomes:
“What should we train AI to do, and how should progress be measured?”
Evaluation Becomes the New Centerpiece
The second half is the era of environment design and evaluation innovation. Instead of focusing solely on training methods or models, the AI community must invent evaluation setups that reflect the complexities and properties of real-world tasks. This involves:
- Building evaluation environments closer to reality, including non-independent and identically distributed (non-IID) data, sequential tasks requiring memory and persistence, and human interaction in the loop.
- Prioritizing metrics that emphasize utility, robustness, and applicability rather than raw benchmark scores.
- Developing novel benchmarks involving human feedback, realistic customer interactions, and long-horizon task completion, reflecting practical challenges.
- Embracing multi-agent and social intelligence elements where AI must understand and operate within social contexts, recognizing intentions, beliefs, and strategies of other agents.
This evolution demands creativity in designing tasks that align AI progress with tangible economic and societal value. Practical examples include improving diagnostic accuracy in healthcare, optimizing resource allocation in supply chains, or enhancing customer service interactions — all of which can be quantified with meaningful, utility-linked metrics.
The Role of Priors, Environments, and Algorithms
In this new phase, the classic RL trio of environment, algorithm, and priors must harmonize to drive real-world success.
- Much progress in RL has historically been about algorithms, often tested in overly-simplified or static environments.
- However, algorithms tend to overfit to their training environments if priors and environment complexity are lacking.
- Priors, often derived from large-scale language models, are now recognized as crucial for generalization and transfer.
- Consequently, much attention must shift to environment design — creating rich, realistic settings that challenge AI along dimensions aligned with utility, not just difficulty.
Implications for Research and Startups
The second half creates a compelling new playbook for AI researchers and startups alike:
- Researchers must prioritize inventing new evaluation tasks and benchmarks tied to real utility, breaking beyond static exam-style problems toward human-centric, dynamic, sequential challenges.
- Progress will come from applying and augmenting the general RL recipe with novel ideas suited for these environments.
- Startups face a marketplace where model innovation alone is insufficient. Instead, competitive advantage lies in identifying high-utility problems, applying the RL recipe effectively, and creating products with a direct impact.
- AI models are increasingly commoditized, with early competitive edges quickly matched by open-source versions and competitors. Therefore, startups must excel in problem framing, environment construction, and integration with human workflows.
- A mindset akin to product management — focusing on end-to-end impact and utility — becomes essential for both academic and industrial AI.
From Benchmark Hillclimbing to Utility-Centered Progress
The historical game of AI can be summarized as a loop:
- Develop novel algorithms and models to climb harder benchmarks.
- Create more challenging benchmarks, and continue iterating.
In the second half, this game reverses its priorities:
- Invent new evaluation setups prioritizing utility and real-world complexity.
- Use the proven RL recipe to solve these new tasks or develop necessary augmentations.
- Iterate with a focus on impact and value, not just score improvements.
While more challenging and unfamiliar, this game promises the real economic and societal benefits that first-half accomplishments lacked.
Conclusion: Welcome to AI’s Second Half
The second half of AI is an exciting leap forward. The breakthrough that RL generalizes opens the door to powerful, unified problem-solving systems. But success now hinges on defining meaningful problems, designing realistic evaluation environments, and focusing on utility in applications.
This moment calls for researchers, engineers, and entrepreneurs to rethink what it means to succeed with AI. It’s no longer enough to make AI “smarter” by benchmark scores; the imperative is to make AI genuinely useful to humanity in complex, dynamic, and socially rich contexts.
Those who understand and embrace this new era — shifting from “solving benchmarks” to “solving utility problems” — will lead the most impactful breakthroughs and innovations in the years ahead.
Welcome to the second half of AI: where the true game changes.
Follow us for more Updates