Can LLMs Self‑Improve After Deployment? Inside Microsoft’s Online Experiential Learning

Microsoft’s Online Experiential Learning framework lets large language models continuously self‑evolve after deployment by extracting experience from user interactions and consolidating it into model parameters, eliminating the need for human labels, reward models, or server‑side environment access, and demonstrating scalable gains across tasks and model sizes.

PaperAgent
PaperAgent
PaperAgent
Can LLMs Self‑Improve After Deployment? Inside Microsoft’s Online Experiential Learning

Why Online Experiential Learning?

The traditional offline training paradigm for large models relies on human‑annotated data for SFT or simulated environments for RLHF/RL, producing static models that cannot learn from real‑world user interactions once deployed. OEL opens an "open‑world" loop where experience gathered in the wild is turned into knowledge and fed back into the model, creating a virtuous "deploy‑and‑train" cycle.

Core Challenges

No reward signal : Real environments usually return only textual feedback (e.g., "you hit a wall") rather than a scalar reward.

No environment access : The server cannot directly query the user‑side game or interaction scene.

Catastrophic forgetting : Continuous learning can degrade the model’s general abilities.

OEL Methodology: Two‑Stage Closed Loop

The OEL framework consists of an Extraction‑Consolidation loop.

Stage 1: Experience Extraction

During multi‑turn interactions the model collects trajectories containing environment text feedback and model actions. A knowledge extractor (typically the model itself) converts these trajectories into structured "experience knowledge". The key is cumulative extraction : each new trajectory is processed with reference to previously accumulated knowledge, enabling progressive integration.

Stage 2: Knowledge Consolidation

The extracted knowledge is internalized into model parameters via On‑Policy Context Distillation . The process involves:

Constructing partial rollout prefixes from interaction trajectories.

Generating responses with a student model.

Using a frozen teacher model (the initial model) together with the experience knowledge to produce a reference distribution.

Optimizing the student model by minimizing the reverse KL divergence to the teacher, thereby aligning the student’s behavior with the knowledge‑enhanced teacher.

Online Learning Loop

Deploy the consolidated model → collect higher‑quality trajectories → extract richer knowledge → next round of consolidation.

This alternating process forms a self‑reinforcing data flywheel .

Experimental Validation: From Games to Real Ability

Experiments were conducted on two TextArena games: Frozen Lake (grid navigation) and Sokoban (box‑pushing). Models must solve the tasks using only textual feedback without explicit rule descriptions.

1. Continuous Online Improvement

Figure 4 shows that the extraction phase steadily raises success rates as experience knowledge accumulates, while the consolidation phase not only preserves performance but also surpasses the original baseline, providing a higher starting point for subsequent rounds.

2. Efficiency and Accuracy Gains

Figure 5 reveals that internalizing knowledge shortens response length to about 70 % of the original, indicating that the model learns to think more efficiently, converting trial‑and‑error exploration into direct knowledge‑driven reasoning.

3. Mitigating Catastrophic Forgetting

Figure 6 compares On‑Policy (used by OEL) with Off‑Policy context distillation. On‑Policy maintains higher in‑distribution success and preserves out‑of‑distribution evaluation accuracy, whereas Off‑Policy suffers noticeable forgetting.

4. Scaling Laws and Knowledge Quality

Table 1 demonstrates that structured knowledge extraction outperforms raw trajectories, which contain noisy, unproductive exploration. Table 2 highlights that strategy consistency matters more than raw model capacity: a 1.7 B model using knowledge extracted from its own trajectories outperforms a 4 B model using knowledge from a stronger teacher, emphasizing the need for distribution‑matched knowledge.

Figure 7 shows consistent gains across Qwen‑3 models of 1.7 B, 4 B, and 8 B parameters, with larger models benefiting more from successive OEL rounds, confirming that knowledge can be accumulated over time.

Key Insights and Implications

Reward‑Free Learning Feasibility : Pure textual feedback suffices for continual learning without reward models or human annotation.

Knowledge Extraction > Raw Data : Transforming interactions into transferable insights is more effective than memorizing trajectories.

Strategy Consistency Principle : The knowledge source must match the learner’s policy distribution; otherwise, stronger teachers can degrade performance.

Efficiency Meets Capability : Internalized experience not only improves accuracy but also reduces inference length, embodying the "learn more, think faster" effect.

https://arxiv.org/pdf/2603.16856
Online Experiential Learning for Language Models
Microsoft Research
https://aka.ms/GeneralAI
LLMAI researchKnowledge DistillationOnline Learningcontinuous training
PaperAgent
Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.