Data Party THU
Oct 31, 2025 · Artificial Intelligence
How SPG’s Sandwich Gradient Boosts Diffusion Language Models Across Four Benchmarks
The SPG algorithm introduces a sandwiched policy gradient that uses computable lower and upper evidence bounds to align reinforcement learning for discrete diffusion language models, achieving faster convergence, higher peaks, and lower variance on four major reasoning benchmarks.
Diffusion Language ModelEUBOPolicy Gradient
0 likes · 9 min read
