How SPG’s Sandwich Gradient Boosts Diffusion Language Models Across Four Benchmarks
The SPG algorithm introduces a sandwiched policy gradient that uses computable lower and upper evidence bounds to align reinforcement learning for discrete diffusion language models, achieving faster convergence, higher peaks, and lower variance on four major reasoning benchmarks.
Background
Discrete diffusion language models (dLLMs) generate text by iteratively adding and removing noise, offering parallel or semi‑autoregressive decoding that improves latency and throughput compared to traditional autoregressive models. However, applying reward‑based alignment or reinforcement learning (RL) to dLLMs has been problematic because the true likelihood is intractable, making standard policy‑gradient methods unsuitable.
Practitioners have historically substituted the ELBO (Evidence Lower Bound) as a proxy for the likelihood, which raises the scores of good samples but fails to adequately penalize bad ones, leading to biased training.
Method: Sandwich Policy Gradient (SPG)
The Meta Tian Yuandong team proposes the Sandwiched Policy Gradient (SPG) , which inserts the intractable true likelihood between a computable lower bound (ELBO) and an upper bound (EUBO). Positive advantages maximize the ELBO, while negative advantages minimize the EUBO, providing a tighter surrogate for the true objective.
Key components include:
Rewriting the policy‑optimization objective as a relative‑advantage weighted log‑likelihood.
Introducing the "sandwich" replacement: maximize the ELBO for positive samples and minimize the EUBO for negative samples, yielding an optimizable lower bound.
Deriving a tractable form of the EUBO based on Rényi variational bounds, with both discrete and continuous‑limit expressions.
Employing block‑wise masking for Monte‑Carlo estimation, which aligns the training distribution with the inference distribution. The sequence is split into equal‑length blocks; one block is randomly masked while preceding blocks remain clean and following blocks are fully masked.
Experiments
SPG was evaluated by fine‑tuning LLaDA‑8B‑Instruct with RL on four reasoning benchmarks: GSM8K, MATH500, Countdown, and Sudoku. Baselines included D1, WD1, and UniGRPO. All methods used the same block‑wise semi‑autoregressive decoding (block size 32, two high‑confidence tokens per step) and comparable temperature settings.
Results show that SPG consistently outperforms baselines across all metrics, achieving higher accuracy, faster reward rise, higher plateau, and lower variance.
Component ablations reveal that removing the negative‑advantage term degrades performance markedly; using only ELBO improves but not to SPG’s level; using only EUBO is stronger yet less stable; the mixture of both delivers the best trade‑off.
Mask‑ablation experiments confirm that block‑wise masking yields significant gains over random masking, especially on the Countdown benchmark.
Conclusion
SPG resolves two long‑standing challenges for RL with dLLMs: the intractable likelihood and the mismatch between training and inference distributions. By sandwiching the true likelihood between ELBO and EUBO and aligning Monte‑Carlo estimation with block‑wise decoding, SPG delivers faster, more stable, and higher‑performing training, leading to top‑rank results on all four evaluated reasoning tasks.
Practitioners can adopt SPG by introducing an upper‑bound term when handling negative advantages or relative rewards, and by using block‑wise decoding and estimation to keep training and inference distributions aligned.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
