How SPG’s Sandwich Gradient Boosts Diffusion Language Models Across Four Benchmarks

The SPG algorithm introduces a sandwiched policy gradient that uses computable lower and upper evidence bounds to align reinforcement learning for discrete diffusion language models, achieving faster convergence, higher peaks, and lower variance on four major reasoning benchmarks.

Data Party THU
Data Party THU
Data Party THU
How SPG’s Sandwich Gradient Boosts Diffusion Language Models Across Four Benchmarks

Background

Discrete diffusion language models (dLLMs) generate text by iteratively adding and removing noise, offering parallel or semi‑autoregressive decoding that improves latency and throughput compared to traditional autoregressive models. However, applying reward‑based alignment or reinforcement learning (RL) to dLLMs has been problematic because the true likelihood is intractable, making standard policy‑gradient methods unsuitable.

Practitioners have historically substituted the ELBO (Evidence Lower Bound) as a proxy for the likelihood, which raises the scores of good samples but fails to adequately penalize bad ones, leading to biased training.

Method: Sandwich Policy Gradient (SPG)

The Meta Tian Yuandong team proposes the Sandwiched Policy Gradient (SPG) , which inserts the intractable true likelihood between a computable lower bound (ELBO) and an upper bound (EUBO). Positive advantages maximize the ELBO, while negative advantages minimize the EUBO, providing a tighter surrogate for the true objective.

Key components include:

Rewriting the policy‑optimization objective as a relative‑advantage weighted log‑likelihood.

Figure
Figure

Introducing the "sandwich" replacement: maximize the ELBO for positive samples and minimize the EUBO for negative samples, yielding an optimizable lower bound.

Figure
Figure

Deriving a tractable form of the EUBO based on Rényi variational bounds, with both discrete and continuous‑limit expressions.

Figure
Figure
Figure
Figure

Employing block‑wise masking for Monte‑Carlo estimation, which aligns the training distribution with the inference distribution. The sequence is split into equal‑length blocks; one block is randomly masked while preceding blocks remain clean and following blocks are fully masked.

Figure
Figure

Experiments

SPG was evaluated by fine‑tuning LLaDA‑8B‑Instruct with RL on four reasoning benchmarks: GSM8K, MATH500, Countdown, and Sudoku. Baselines included D1, WD1, and UniGRPO. All methods used the same block‑wise semi‑autoregressive decoding (block size 32, two high‑confidence tokens per step) and comparable temperature settings.

Results show that SPG consistently outperforms baselines across all metrics, achieving higher accuracy, faster reward rise, higher plateau, and lower variance.

Figure
Figure

Component ablations reveal that removing the negative‑advantage term degrades performance markedly; using only ELBO improves but not to SPG’s level; using only EUBO is stronger yet less stable; the mixture of both delivers the best trade‑off.

Figure
Figure

Mask‑ablation experiments confirm that block‑wise masking yields significant gains over random masking, especially on the Countdown benchmark.

Figure
Figure

Conclusion

SPG resolves two long‑standing challenges for RL with dLLMs: the intractable likelihood and the mismatch between training and inference distributions. By sandwiching the true likelihood between ELBO and EUBO and aligning Monte‑Carlo estimation with block‑wise decoding, SPG delivers faster, more stable, and higher‑performing training, leading to top‑rank results on all four evaluated reasoning tasks.

Practitioners can adopt SPG by introducing an upper‑bound term when handling negative advantages or relative rewards, and by using block‑wise decoding and estimation to keep training and inference distributions aligned.

benchmarkreinforcement learningPolicy GradientDiffusion Language ModelEUBOSPG
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.