Can Diffusion Chains Unlock More Creative Reasoning in Large Language Models?

Recent work from West Lake University's MAPLE Lab introduces a diffusion‑based “Divergent Thought Chain” that treats each intermediate denoising step of a diffusion language model as a reasoning step, using result‑based reinforcement learning to optimize non‑linear token generation and achieving state‑of‑the‑art performance on math and code tasks.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Can Diffusion Chains Unlock More Creative Reasoning in Large Language Models?

Background

Large language models (LLMs) typically rely on linear chain‑of‑thought (CoT) prompting, which forces token generation in a fixed causal order. Human cognition, by contrast, often follows a non‑linear, divergent thinking process that jumps between concepts before integrating them into a coherent answer.

Diffusion‑Based Divergent Thought Chain (DCoLT)

The MAPLE Lab proposes a new inference paradigm called the Diffusion Chain of Lateral Thoughts (DCoLT). In diffusion language models, generation proceeds by reversing a diffusion process that gradually denoises a fully masked sequence. Each intermediate denoised state

Diffusion process illustration
Diffusion process illustration

is treated as a distinct reasoning step, allowing the model to explore non‑linear generation paths.

Methodology

Continuous‑time diffusion models (SEDD)

The model estimates a time‑dependent transition matrix

Continuous diffusion equation
Continuous diffusion equation

and samples intermediate states

Instantaneous transition rate
Instantaneous transition rate

. By applying Euler integration, the probability of each step is computed and used as the policy distribution for reinforcement learning.

Discrete‑time diffusion models (LLaDA)

Generation starts from a fully masked token sequence and iteratively unmasks tokens. The authors introduce an Unmask Policy Module (UPM) that scores each masked token

UPM scoring
UPM scoring

and selects a subset using a Plackett‑Luce model

Plackett‑Luce sampling
Plackett‑Luce sampling

. The selected tokens are then predicted in parallel, forming the second stage of each action.

Reinforcement Learning Framework

The entire sequence of actions (mask‑selection and token‑prediction) is treated as a multi‑step decision process. A reward of 1 is assigned only if the final answer is correct, encouraging the model to discover diverse, non‑linear reasoning trajectories without any explicit supervision on intermediate steps.

Experiments

The authors evaluate DCoLT on two representative diffusion language models: SEDD and LLaDA (named LLaDOU after adding the ordered unmasking module). On the GSM8K‑Aug math reasoning benchmark, the SEDD‑based DCoLT achieves 57.0% accuracy, surpassing both standard CoT and DoT baselines. On LLaDA, the LLaDOU model improves both mathematical reasoning accuracy and code‑generation pass rates, outperforming existing diffusion models. Visualization of token generation order shows that early steps prioritize key numbers and operators, while later steps fill in surrounding text, confirming the model’s ability to reason in a flexible, non‑sequential manner.

Conclusion

DCoLT demonstrates that treating diffusion denoising steps as reasoning actions and optimizing them with result‑based reinforcement learning can substantially boost the problem‑solving capabilities of diffusion language models. The approach bridges the gap between human‑like divergent thinking and machine‑generated text, opening new avenues for advanced LLM inference.

References

Paper: "Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models" – arXiv:2505.10446 https://arxiv.org/abs/2505.10446

GitHub repository: https://github.com/maple-research-lab/LLaDOU

Code example

收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!
code generationchain of thoughtReinforcement learningmath reasoningdiffusion language models
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.