9 min read

How d1 Boosts Reasoning in Diffusion LLMs with Reinforcement Learning

Researchers from UCLA and Meta AI introduce d1, a two‑stage post‑training framework that combines supervised fine‑tuning and a novel diffu‑GRPO reinforcement‑learning algorithm to enable efficient reasoning in masked diffusion large language models, achieving state‑of‑the‑art performance on multiple math and logic benchmarks.

AI Frontier Lectures

Apr 24, 2025

How d1 Boosts Reasoning in Diffusion LLMs with Reinforcement Learning

Background

Large language models (LLMs) achieve strong reasoning when fine‑tuned with reinforcement learning (RL), but most work targets autoregressive (AR) models that generate tokens left‑to‑right. Diffusion LLMs (dLLMs) generate text by iterative denoising and can attend bidirectionally, providing a non‑AR alternative. Open‑source dLLMs such as LLaDA have not been combined with RL post‑training, leaving an open problem: how to apply RL efficiently to masked, non‑AR models.

Method: d1 Framework

The authors (UCLA & Meta AI) propose a two‑stage post‑training pipeline called d1 :

Supervised fine‑tuning (SFT) on high‑quality reasoning trajectories.

RL stage using a novel policy‑gradient algorithm diffu‑GRPO , which extends GRPO with a one‑step log‑probability estimator tailored for masked dLLMs.

diffu‑GRPO adds random prompt‑mask regularization, increasing the number of gradient updates per batch and reducing the amount of online generation required, thus lowering computational cost.

Efficient Log‑Probability Estimation for dLLMs

In AR models the token‑wise log‑probability factorizes, allowing cheap sequence‑level likelihood computation. For dLLMs each token’s probability would normally require multiple calls to the masked predictor f_θ. The authors introduce two estimators:

Token‑level estimator: a single call to f_θ yields an unbiased estimate of the log‑probability of each token.

Sequence‑level estimator: a mean‑field factorization approximates the full sequence log‑probability as the sum of independent token log‑probabilities.

These estimators provide the likelihood ratios needed for GRPO’s token‑level advantage weighting and for the KL‑regularization term.

diffu‑GRPO Loss

Using the estimators, the GRPO objective is extended to masked dLLMs. The resulting loss jointly optimizes a token‑wise advantage term and a sequence‑wise KL divergence term:

L = -E_{π_θ}[A_t·log π_θ(t|x)] + λ·KL(π_θ || π_{θ_{old}})_{seq}

where A_t is the advantage computed from the reward model, and the KL term is evaluated with the sequence‑level estimator.

Experiments

Four reasoning benchmarks (mathematical and logical) are used to evaluate the framework. The base model is LLaDA‑8B‑Instruct. Four variants are compared:

Baseline LLaDA‑8B‑Instruct.

LLaDA fine‑tuned with SFT only.

LLaDA fine‑tuned with diffu‑GRPO only.

d1‑LLaDA (SFT + diffu‑GRPO).

Across all tasks, d1‑LLaDA achieves higher zero‑shot accuracy than the baseline. diffu‑GRPO alone also outperforms SFT, and the combination yields the largest gains.

Key Results

diffu‑GRPO improves performance in all 12 experimental settings compared with both the baseline and SFT.

d1‑LLaDA exceeds LLaDA + SFT in every setting, demonstrating a synergistic effect of the two stages.

For long sequences (512 tokens) the model exhibits self‑correction and back‑tracking behaviors not seen in shorter generations.

Conclusion

The d1 framework shows that a two‑stage post‑training pipeline—first SFT, then diffu‑GRPO—makes RL feasible for masked diffusion LLMs and substantially boosts their reasoning abilities. The proposed log‑probability estimators enable efficient policy‑gradient updates without the factorization available in AR models.