Artificial Intelligence 8 min read

How On-Policy Context Distillation Enables LLMs to Retain Experience Forever

On-Policy Context Distillation (OPCD) compresses transient in‑context knowledge into LLM parameters, allowing models to permanently retain problem‑solving experience without ground‑truth labels; the article details the OPCD framework, training steps, teacher‑student configurations, and experimental results on math, games, and system‑prompt tasks, highlighting its advantages over traditional context distillation.

PaperAgent

Mar 1, 2026

How On-Policy Context Distillation Enables LLMs to Retain Experience Forever

Introduction

Large language models (LLMs) excel at in‑context learning, but the knowledge they acquire is fleeting: once a conversation ends, the model forgets the experience. On-Policy Context Distillation (OPCD) addresses this limitation by compressing contextual knowledge into the model’s parameters, enabling permanent retention without requiring additional ground‑truth labels.

OPCD Framework

Core Idea

The key innovation of OPCD is that the student model learns from its own generation trajectory rather than from a teacher’s trajectory. This on‑policy approach forces the student to focus on high‑probability regions of the teacher’s distribution via reverse KL divergence.

Technical Details

OPCD minimizes the reverse KL divergence between the student’s output (generated without context) and the teacher’s output (generated with full context). The training loop consists of three steps:

On‑Policy Sampling : The student generates a complete response under no‑context conditions.

Teacher Evaluation : The teacher evaluates each token’s probability given the full context.

Reverse KL Alignment : The student updates its parameters to reduce the reverse KL divergence, effectively seeking the teacher’s high‑probability modes.

Algorithm pseudocode is illustrated in Figure 1.

Teacher‑Student Configurations

Teacher‑Student : Teacher and student are separate models (teacher may be larger or frozen). This default setup yields stable training.

Self‑Distillation : Teacher and student share weights and are updated jointly, enabling a single model to improve itself.

Experimental Applications

1. Experience Knowledge Distillation

The authors define a three‑stage pipeline: experience extraction, experience accumulation, and experience solidification using OPCD. No ground‑truth labels are required, making the process fully self‑supervised.

Datasets used include DAPO‑Math‑17K (14K English math problems), Frozen Lake (3×3 navigation), and Sokoban (6×6 box‑pushing). Results (Tables 1‑2) show that OPCD improves test‑time accuracy and out‑of‑distribution (OOD) performance compared to off‑policy context distillation.

2. System Prompt Distillation

System prompts guide model behavior (e.g., medical QA, safety auditing) but long prompts increase inference cost. OPCD can internalize these patterns. Experiments on MedMCQA (medical) and safety benchmarks demonstrate that OPCD preserves or improves performance while eliminating the need for lengthy prompts (Tables 3‑4).

Deep Analysis

5.1 Cross‑Size Distillation

Transferring experience from a large model to a smaller one directly harms performance. OPCD aligns the knowledge on‑policy, allowing the small model to benefit without degradation (Figure 2).

5.2 Mitigating Catastrophic Forgetting

OPCD not only boosts in‑domain performance but also preserves OOD capabilities, as shown by superior accuracy on safety and medical test sets compared to traditional context distillation (Figure 3).

5.3 Teacher‑Student vs. Self‑Distillation

Teacher‑Student OPCD is more stable and yields higher performance than self‑distillation, which suffers from high‑variance learning signals due to a constantly changing teacher.

5.4 Knowledge vs. Raw Trace

Using structured experience knowledge rather than raw generation traces yields better performance on verification sets, indicating that distilled knowledge is more transferable and less noisy.

One‑Sentence Takeaway

OPCD enables large models to continuously learn from their own “battle experience,” distill patterns, and permanently internalize this wisdom, eliminating the need to start from scratch for each new task.

https://arxiv.org/pdf/2602.12275
On-Policy Context Distillation for Language Models

Artificial Intelligence LLM model compression self-supervised learning knowledge distillation OPCD

Written by

PaperAgent

Daily updates, analyzing cutting-edge AI research papers

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.