Artificial Intelligence 10 min read

D-OPSD: On‑Policy Self‑Distillation Lets Few‑Step Diffusion Models Learn While Running

D-OPSD presents the first online self‑distillation framework for step‑distilled diffusion models, allowing them to continuously fine‑tune with only image‑text pairs, retain their fast few‑step sampling, and acquire new concepts, styles, or domain preferences without reward models.

Machine Heart

May 15, 2026

D-OPSD: On‑Policy Self‑Distillation Lets Few‑Step Diffusion Models Learn While Running

Core Problem

Step‑distilled diffusion models (e.g., Z‑Image‑Turbo) achieve high‑quality image generation with only a few sampling steps. When these models are continuously fine‑tuned, traditional supervised fine‑tuning (SFT) and offline reinforcement‑learning (RL) methods cause severe train‑inference distribution shift, leading the models to forget their few‑step capability.

Limitations of Existing Paradigms

Vanilla SFT : Supervises the model with ground‑truth velocity derived from the target image, but the supervision comes from states that never appear in the model’s own few‑step rollout, creating a mismatch between training and inference.

Offline RL (Diffusion‑DPO, PSO) : Introduces pairwise preference supervision, yet the optimized states are still not fully induced by the student’s current distribution.

Online RL (ReFL, Flow‑GRPO) : Trains on model rollouts and better preserves few‑step behavior, but requires a reward model that most developers do not have.

D‑OPSD Design

D‑OPSD (On‑Policy Self‑Distillation) is the first online self‑distillation framework for few‑step diffusion models. It needs only image‑text pairs—no reward model or paired preference data. By keeping training and inference consistent and using the target image as a stronger contextual teacher signal, D‑OPSD balances concept learning, visual quality, prompt adherence, and retention of prior knowledge.

Key Insight: Contextual Learning from LLM/VLM

On‑policy distillation in large language models shows that a student can be trained on its own roll‑outs while a teacher, conditioned on richer context, provides stronger supervision. Diffusion models equipped with LLM/VLM encoders exhibit similar “in‑context learning”: when the target image is fed together with the text prompt, the model can generate variants that preserve the target concept or style without any extra training.

Method Framework

For each image‑text pair, the student and teacher conditions are encoded. The student samples an on‑policy trajectory; the teacher, conditioned on both image and text, predicts a stronger velocity field for the same state. The student’s velocity prediction is aligned to the teacher’s via mean‑squared error, the student is updated, and the teacher is synchronously updated with an exponential‑moving‑average (EMA) of the student.

Because diffusion models predict continuous velocity fields rather than discrete token distributions, the alignment loss is an MSE on velocity predictions, analogous to KL‑based alignment in language models.

Why D‑OPSD Preserves Few‑Step Ability

Unlike SFT, D‑OPSD never forces the model to fit target‑image states that never appear in its own few‑step rollout. Optimization stays on the student’s actual rollout, dramatically reducing train‑inference mismatch and allowing the model to acquire new concepts while retaining its original fast sampling behavior.

Experimental Results

LoRA Customization (few‑shot concept learning)

With only a handful of image‑text pairs, D‑OPSD learns new concepts, maintains high visual quality, and generalizes to unseen prompts. Compared to baselines:

Baseline model: fails to understand new concepts.

SFT: learns concepts but suffers severe quality degradation (blur, artifacts).

PSO: improves quality but shows poor concept fidelity.

D‑OPSD : retains high quality while accurately reproducing target concepts and blending them naturally.

Full‑Model Fine‑Tuning (adapting to a new domain)

D‑OPSD adapts the model to a target domain (e.g., anime style) while preserving original domain knowledge and few‑step inference. Compared to:

SFT: over‑fits the target domain and forgets original knowledge.

PSO: retains some prior knowledge but adapts insufficiently.

D‑OPSD : excels in the target domain and keeps original generation quality, achieving “learn new without forgetting old”.

Future Research Directions

Richer teacher context: incorporate image‑editing or video generation models as guidance.

Additional training constraints: combine other objectives to further boost performance.

Multi‑expert online strategy distillation: train domain‑specific experts and distill them back into a single base model within the D‑OPSD framework.

Paper: https://arxiv.org/abs/2605.05204

Project page: https://vvvvvjdy.github.io/d-opsd/

Code repository: https://github.com/vvvvvjdy/D-OPSD

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LoRA diffusion models self-distillation image-text alignment few-step sampling on-policy learning

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.