How We Built a Self‑Evolving AI System Without Reward Functions

The Oxford study demonstrates that large language models can self‑evolve through a four‑step deploy‑validate‑filter‑inherit loop, eliminating handcrafted reward functions, and achieves dramatic performance gains on Blocksworld, Rovers, and Sokoban while providing theoretical proof of equivalence to REINFORCE.

AI Engineering
AI Engineering
AI Engineering
How We Built a Self‑Evolving AI System Without Reward Functions

Reward Function Dilemma and a New Path

Most researchers assume that AI progress requires larger architectures or carefully engineered reinforcement‑learning pipelines, yet designing a perfect reward function for real‑world tasks is notoriously difficult.

The Oxford team proposes a breakthrough: let the environment itself serve as the most impartial judge and let a large language model (LLM) iterate autonomously through a natural four‑step loop.

Four‑Step Natural Loop

Natural Deployment : the model attempts to solve problems in a real environment.

Objective Verification : external tools verify the correctness of the generated plan.

Survival of the Fittest : only verified solutions are retained.

Generational Inheritance : successful experiences are merged to train the next generation.

Rigorous Experiments on Classic Planning Domains

The researchers evaluated the approach on three benchmark planning problems—Blocksworld (block rearrangement), Rovers (Mars rover tasks), and Sokoban (box‑pushing). For each domain they generated 1,000 instances and used Qwen3 4B Thinking 2507 as the base model.

Technical Implementation of the Loop

Deployment Generation : the model generates planning proposals with chain‑of‑thought reasoning at temperature 0.6.

External Verification : standard planning‑competition tools (VAL) check solution validity.

Data Filtering : only valid proposals are kept, and the best solution per task is selected.

Supervised Fine‑Tuning : LoRA merges successful cases from all generations into the next model.

Although the pipeline appears simple, it yields striking improvements.

Revolutionary Data Efficiency

Filtering proved crucial: the filtered version achieved strong performance with just 356 high‑quality trajectories, whereas the unfiltered version required 4,017 trajectories and performed worse.

Quantitative Gains Across Domains

Blocksworld performance increased by 196%.

Rovers performance increased by 401%.

Sokoban performance increased by 196%.

Beyond raw scores, the capability qualitatively transformed: the baseline model could plan up to 20 steps in Blocksworld, while the fifth‑generation model handled 35‑step plans, demonstrating emergent complex reasoning without handcrafted curricula.

Mathematical Proofs of Equivalence

The team proved three core propositions:

Proposition 1 shows that the gradient direction of supervised fine‑tuning (SFT) on valid trajectories matches the REINFORCE gradient for binary rewards, differing only by a positive scalar.

Proposition 2 demonstrates that SFT with multi‑generation data is equivalent to REINFORCE with importance sampling, converting off‑policy data into an on‑policy expectation.

Proposition 3 follows from the first two, establishing a formal mathematical equivalence between the iterative deployment loop and traditional reinforcement learning.

Consequently, environment feedback automatically defines the most direct reward signal, removing the need for manually engineered scoring rules.

Two‑Sided Nature of Natural Evolution

Advantages are clear: lower design barriers, higher adaptability, and reduced bias from handcrafted rewards. However, because the reward is implicit, models may optimize unintended objectives, and small feedback biases can amplify over many iterations.

An interesting observation is that, unlike RL‑fine‑tuned models which tend to increase inference token counts, the iterative deployment models keep average token usage stable (≈ 2000 tokens) across generations, indicating efficiency gains without extra computational cost.

Conclusion and Safety Outlook

The authors emphasize that they are “discovering, not inventing,” noting that each deployment initiates a self‑evolution process. The findings suggest that iterative deployment can replace traditional RL pipelines for tasks with hard‑to‑define rewards but easy verification.

Nevertheless, the implicit reward function raises safety concerns: user preferences and platform mechanisms may become hidden training signals, potentially diverging from original alignment goals. The team calls for monitoring and intervention mechanisms to prevent models from drifting.

Paper: https://arxiv.org/abs/2512.24940

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI safetyQwen3self-evolving AILLM planningREINFORCE equivalencereward-free RL
AI Engineering
Written by

AI Engineering

Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.