How On-Policy Distillation Cuts LLM Training Cost by 90%
Thinking Machines Lab introduces On-Policy Distillation, a post‑training technique that matches reinforcement‑learning performance while reducing compute cost by up to tenfold, and demonstrates its effectiveness through extensive experiments on reasoning, personalization, and catastrophic‑forgetting mitigation.
OpenAI former CTO Mira Murati's startup Thinking Machines Lab (TML) announced a new research result after releasing its first product Tinker: On-Policy Distillation , a post‑training method that achieves reinforcement‑learning‑level performance at one‑tenth the cost.
The blog is authored by former OpenAI researcher Kevin Lu, who helped launch GPT‑4o mini and contributed to models such as o1‑mini, o3, and GPT‑5. Since its founding in February, TML (valued at $12 billion) has released Tinker, started the Connectionism research blog, and published several technical posts.
Post‑training dilemma
In the post‑training stage of large language models, practitioners typically face two choices:
On‑policy : the model learns by self‑exploration, like playing chess without a coach.
Off‑policy : the model imitates a fixed dataset generated by a strong teacher, like watching a master’s games that occur only in “comfort‑zone” states.
Both approaches have drawbacks: on‑policy suffers from low data efficiency and sparse rewards, while off‑policy can lead to distribution‑shift failures when the student encounters unseen situations.
Three stages of LLM training
Pre‑training : learns general language, reasoning, and world knowledge.
Mid‑training : acquires domain‑specific knowledge (code, medical records, internal documents).
Post‑training : aligns the model to follow instructions, solve math problems, or chat.
Small, fine‑tuned models often outperform large generic models in specific domains, offering privacy, easier updates, and lower inference cost.
On‑policy Distillation: the best of both worlds
TML proposes training a compact student model that receives dense, per‑token supervision from a powerful teacher model, eliminating the need for sparse RL rewards. The teacher provides the full token‑wise probability distribution (or sampled tokens), allowing the student to match the teacher’s output at every step.
"If you learn chess by playing yourself, you only get a single win/lose signal per game and never know which move mattered. Off‑policy distillation is like watching a grandmaster play, but the positions are ones a beginner would never see."
This dense supervision is implemented via a reverse KL divergence loss applied token‑by‑token, with a discount factor of zero so the student only optimizes the immediate next token.
Key advantages:
No need to wait for complete trajectories; short or partial sequences suffice.
Teacher queries are cheap (single forward pass), while the student does the heavy sampling work.
No separate reward or annotation model is required.
Experimental validation
Experiments show that on‑policy distillation can achieve the same performance as RL with roughly 10× less compute. In a math‑reasoning benchmark (AIME'24), a Qwen3‑8B student distilled from a Qwen3‑32B teacher reached 70% accuracy after only ~150 steps, compared to 160 % more data needed for continued SFT or ~17 920 GPU‑hours for RL.
Cost analysis indicates a 9× reduction when SFT data is ready, and up to 30× reduction when new SFT data must be generated.
Personalization and catastrophic‑forgetting mitigation
On‑policy distillation also helps retain previously learned knowledge while adapting to new tasks. TML demonstrated a two‑step process—first fine‑tune on mixed internal‑document and chat data, then perform on‑policy distillation using an earlier model as the teacher—to fully recover instruction‑following performance without losing newly acquired knowledge.
"Essentially, the language model itself becomes the reward model; any open‑source instruction‑tuned model that can compute log‑probs can serve as the teacher."
Conclusion
TML’s work challenges the prevailing belief that expensive RL exploration is the only path to state‑of‑the‑art capabilities, showing that efficient distillation can inherit high‑level abilities with a fraction of the compute.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
