Can AI Learn on the Job? RLVR, OPSD, and Dreaming for the Next‑Gen Training Paradigm

The article examines Dwarkesh Patel’s view that future AI must move beyond one‑off pre‑training to continual, on‑the‑job learning, discussing Reinforcement Learning with Verifiable Rewards (RLVR), the need for "grindable" tasks, and emerging approaches like on‑policy self‑distillation (OPSD) and "dreaming" to write real‑world experience back into model weights.

Machine Heart
Machine Heart
Machine Heart
Can AI Learn on the Job? RLVR, OPSD, and Dreaming for the Next‑Gen Training Paradigm

Dwarkesh Patel, a prominent AI podcast host, asks what the next generation of AI training will look like. He highlights a current trend he calls RLVR (Reinforcement Learning with Verifiable Rewards), which trains agents by repeatedly trial‑and‑error on tasks whose outcomes can be automatically verified.

RLVR works well for code‑fixing or math problems because the tasks are both verifiable and "grindable"—they can be duplicated thousands of times, run in parallel, and reset easily. However, Patel argues that verifiability alone is insufficient; a task must also be scalable for massive rollout.

He points out that many real‑world tasks—such as completing a purchase on Amazon, running a legal case, or launching a startup—have slow feedback, many variables, and environments that cannot be reset or cloned at scale. These are akin to reset‑free, non‑stationary environments in reinforcement learning, making them unsuitable for straightforward RLVR training.

Patel then asks whether agents trained in verifiable, grindable environments can truly generalize to such complex domains. Optimists claim that with enough diverse RLVR experiences, agents will acquire general planning and error‑correction abilities that transfer to entrepreneurship, politics, law, and research. Patel remains skeptical because valuable real‑world knowledge often emerges from ambiguous, non‑repeatable signals that cannot be captured by simple "刷题" (massive practice).

The discussion pivots to the concept of "learning back to the weights." Current large models excel at in‑context learning but only retain knowledge within a session’s context window; after the conversation ends, the learning is lost. Patel argues that true continual learning should distill useful experience from deployment back into model parameters.

One proposed mechanism is on‑policy self‑distillation (OPSD). In OPSD, a model that has accumulated experience in a long session acts as a teacher; a base model is trained to mimic the teacher’s decisions without needing an external reward signal. Unlike standard supervised fine‑tuning, which merely repeats observed tokens, OPSD extracts dense supervision from token‑level teacher‑student probability differences, compressing scarce real‑task insights into precise weight updates.

Patel also introduces "dreaming," where an AI builds its own simulated environment from observations of the real world and practices strategies within that simulator. This resembles model‑based reinforcement learning and echoes ideas from Sutton’s work, as cited in the article "Welcome to the Era of Experience" by David Silver and Richard Sutton.

He envisions a future fourth scaling axis—test‑time training or dreaming—added to the traditional axes of pre‑training, reinforcement learning, and inference‑time compute. In a 2027‑2028 scenario, the pipeline would first use RLVR to produce a competent agent, then deploy it to a real task for a week, collect user feedback (thumbs up/down or written evaluation), and finally distill the gained experience back into the base model via OPSD, dreaming, or a yet‑unknown technique.

If this loop succeeds, AI would no longer be limited by the initial set of verifiable tasks. It would first master code, math, web, and tool use through RLVR, then acquire organization‑level, business‑process, and collaborative skills from real deployments, progressively expanding into adjacent domains.

Patel concludes that the next major shift in AI training is moving from pre‑deployment training to post‑deployment continual learning, from static human‑generated data to experience gathered in the wild, and from temporary in‑context adaptation to permanent weight‑based capability growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Reinforcement LearningContinual LearningRLVRDreamingAI Training ParadigmsOn‑policy Self‑DistillationVerifiable Rewards
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.