How We Built a Self‑Evolving AI System Without Reward Functions
The Oxford study demonstrates that large language models can self‑evolve through a four‑step deploy‑validate‑filter‑inherit loop, eliminating handcrafted reward functions, and achieves dramatic performance gains on Blocksworld, Rovers, and Sokoban while providing theoretical proof of equivalence to REINFORCE.
Reward Function Dilemma and a New Path
Most researchers assume that AI progress requires larger architectures or carefully engineered reinforcement‑learning pipelines, yet designing a perfect reward function for real‑world tasks is notoriously difficult.
The Oxford team proposes a breakthrough: let the environment itself serve as the most impartial judge and let a large language model (LLM) iterate autonomously through a natural four‑step loop.
Four‑Step Natural Loop
Natural Deployment : the model attempts to solve problems in a real environment.
Objective Verification : external tools verify the correctness of the generated plan.
Survival of the Fittest : only verified solutions are retained.
Generational Inheritance : successful experiences are merged to train the next generation.
Rigorous Experiments on Classic Planning Domains
The researchers evaluated the approach on three benchmark planning problems—Blocksworld (block rearrangement), Rovers (Mars rover tasks), and Sokoban (box‑pushing). For each domain they generated 1,000 instances and used Qwen3 4B Thinking 2507 as the base model.
Technical Implementation of the Loop
Deployment Generation : the model generates planning proposals with chain‑of‑thought reasoning at temperature 0.6.
External Verification : standard planning‑competition tools (VAL) check solution validity.
Data Filtering : only valid proposals are kept, and the best solution per task is selected.
Supervised Fine‑Tuning : LoRA merges successful cases from all generations into the next model.
Although the pipeline appears simple, it yields striking improvements.
Revolutionary Data Efficiency
Filtering proved crucial: the filtered version achieved strong performance with just 356 high‑quality trajectories, whereas the unfiltered version required 4,017 trajectories and performed worse.
Quantitative Gains Across Domains
Blocksworld performance increased by 196%.
Rovers performance increased by 401%.
Sokoban performance increased by 196%.
Beyond raw scores, the capability qualitatively transformed: the baseline model could plan up to 20 steps in Blocksworld, while the fifth‑generation model handled 35‑step plans, demonstrating emergent complex reasoning without handcrafted curricula.
Mathematical Proofs of Equivalence
The team proved three core propositions:
Proposition 1 shows that the gradient direction of supervised fine‑tuning (SFT) on valid trajectories matches the REINFORCE gradient for binary rewards, differing only by a positive scalar.
Proposition 2 demonstrates that SFT with multi‑generation data is equivalent to REINFORCE with importance sampling, converting off‑policy data into an on‑policy expectation.
Proposition 3 follows from the first two, establishing a formal mathematical equivalence between the iterative deployment loop and traditional reinforcement learning.
Consequently, environment feedback automatically defines the most direct reward signal, removing the need for manually engineered scoring rules.
Two‑Sided Nature of Natural Evolution
Advantages are clear: lower design barriers, higher adaptability, and reduced bias from handcrafted rewards. However, because the reward is implicit, models may optimize unintended objectives, and small feedback biases can amplify over many iterations.
An interesting observation is that, unlike RL‑fine‑tuned models which tend to increase inference token counts, the iterative deployment models keep average token usage stable (≈ 2000 tokens) across generations, indicating efficiency gains without extra computational cost.
Conclusion and Safety Outlook
The authors emphasize that they are “discovering, not inventing,” noting that each deployment initiates a self‑evolution process. The findings suggest that iterative deployment can replace traditional RL pipelines for tasks with hard‑to‑define rewards but easy verification.
Nevertheless, the implicit reward function raises safety concerns: user preferences and platform mechanisms may become hidden training signals, potentially diverging from original alignment goals. The team calls for monitoring and intervention mechanisms to prevent models from drifting.
Paper: https://arxiv.org/abs/2512.24940
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
