Can LLMs Self‑Improve After Deployment? Inside Microsoft’s Online Experiential Learning
Microsoft’s Online Experiential Learning framework lets large language models continuously self‑evolve after deployment by extracting experience from user interactions and consolidating it into model parameters, eliminating the need for human labels, reward models, or server‑side environment access, and demonstrating scalable gains across tasks and model sizes.
Why Online Experiential Learning?
The traditional offline training paradigm for large models relies on human‑annotated data for SFT or simulated environments for RLHF/RL, producing static models that cannot learn from real‑world user interactions once deployed. OEL opens an "open‑world" loop where experience gathered in the wild is turned into knowledge and fed back into the model, creating a virtuous "deploy‑and‑train" cycle.
Core Challenges
No reward signal : Real environments usually return only textual feedback (e.g., "you hit a wall") rather than a scalar reward.
No environment access : The server cannot directly query the user‑side game or interaction scene.
Catastrophic forgetting : Continuous learning can degrade the model’s general abilities.
OEL Methodology: Two‑Stage Closed Loop
The OEL framework consists of an Extraction‑Consolidation loop.
Stage 1: Experience Extraction
During multi‑turn interactions the model collects trajectories containing environment text feedback and model actions. A knowledge extractor (typically the model itself) converts these trajectories into structured "experience knowledge". The key is cumulative extraction : each new trajectory is processed with reference to previously accumulated knowledge, enabling progressive integration.
Stage 2: Knowledge Consolidation
The extracted knowledge is internalized into model parameters via On‑Policy Context Distillation . The process involves:
Constructing partial rollout prefixes from interaction trajectories.
Generating responses with a student model.
Using a frozen teacher model (the initial model) together with the experience knowledge to produce a reference distribution.
Optimizing the student model by minimizing the reverse KL divergence to the teacher, thereby aligning the student’s behavior with the knowledge‑enhanced teacher.
Online Learning Loop
Deploy the consolidated model → collect higher‑quality trajectories → extract richer knowledge → next round of consolidation.
This alternating process forms a self‑reinforcing data flywheel .
Experimental Validation: From Games to Real Ability
Experiments were conducted on two TextArena games: Frozen Lake (grid navigation) and Sokoban (box‑pushing). Models must solve the tasks using only textual feedback without explicit rule descriptions.
1. Continuous Online Improvement
Figure 4 shows that the extraction phase steadily raises success rates as experience knowledge accumulates, while the consolidation phase not only preserves performance but also surpasses the original baseline, providing a higher starting point for subsequent rounds.
2. Efficiency and Accuracy Gains
Figure 5 reveals that internalizing knowledge shortens response length to about 70 % of the original, indicating that the model learns to think more efficiently, converting trial‑and‑error exploration into direct knowledge‑driven reasoning.
3. Mitigating Catastrophic Forgetting
Figure 6 compares On‑Policy (used by OEL) with Off‑Policy context distillation. On‑Policy maintains higher in‑distribution success and preserves out‑of‑distribution evaluation accuracy, whereas Off‑Policy suffers noticeable forgetting.
4. Scaling Laws and Knowledge Quality
Table 1 demonstrates that structured knowledge extraction outperforms raw trajectories, which contain noisy, unproductive exploration. Table 2 highlights that strategy consistency matters more than raw model capacity: a 1.7 B model using knowledge extracted from its own trajectories outperforms a 4 B model using knowledge from a stronger teacher, emphasizing the need for distribution‑matched knowledge.
Figure 7 shows consistent gains across Qwen‑3 models of 1.7 B, 4 B, and 8 B parameters, with larger models benefiting more from successive OEL rounds, confirming that knowledge can be accumulated over time.
Key Insights and Implications
Reward‑Free Learning Feasibility : Pure textual feedback suffices for continual learning without reward models or human annotation.
Knowledge Extraction > Raw Data : Transforming interactions into transferable insights is more effective than memorizing trajectories.
Strategy Consistency Principle : The knowledge source must match the learner’s policy distribution; otherwise, stronger teachers can degrade performance.
Efficiency Meets Capability : Internalized experience not only improves accuracy but also reduces inference length, embodying the "learn more, think faster" effect.
https://arxiv.org/pdf/2603.16856
Online Experiential Learning for Language Models
Microsoft Research
https://aka.ms/GeneralAIHow this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
