Can Diffusion Chains Unlock More Creative Reasoning in Large Language Models?
Recent work from West Lake University's MAPLE Lab introduces a diffusion‑based “Divergent Thought Chain” that treats each intermediate denoising step of a diffusion language model as a reasoning step, using result‑based reinforcement learning to optimize non‑linear token generation and achieving state‑of‑the‑art performance on math and code tasks.
Background
Large language models (LLMs) typically rely on linear chain‑of‑thought (CoT) prompting, which forces token generation in a fixed causal order. Human cognition, by contrast, often follows a non‑linear, divergent thinking process that jumps between concepts before integrating them into a coherent answer.
Diffusion‑Based Divergent Thought Chain (DCoLT)
The MAPLE Lab proposes a new inference paradigm called the Diffusion Chain of Lateral Thoughts (DCoLT). In diffusion language models, generation proceeds by reversing a diffusion process that gradually denoises a fully masked sequence. Each intermediate denoised state
is treated as a distinct reasoning step, allowing the model to explore non‑linear generation paths.
Methodology
Continuous‑time diffusion models (SEDD)
The model estimates a time‑dependent transition matrix
and samples intermediate states
. By applying Euler integration, the probability of each step is computed and used as the policy distribution for reinforcement learning.
Discrete‑time diffusion models (LLaDA)
Generation starts from a fully masked token sequence and iteratively unmasks tokens. The authors introduce an Unmask Policy Module (UPM) that scores each masked token
and selects a subset using a Plackett‑Luce model
. The selected tokens are then predicted in parallel, forming the second stage of each action.
Reinforcement Learning Framework
The entire sequence of actions (mask‑selection and token‑prediction) is treated as a multi‑step decision process. A reward of 1 is assigned only if the final answer is correct, encouraging the model to discover diverse, non‑linear reasoning trajectories without any explicit supervision on intermediate steps.
Experiments
The authors evaluate DCoLT on two representative diffusion language models: SEDD and LLaDA (named LLaDOU after adding the ordered unmasking module). On the GSM8K‑Aug math reasoning benchmark, the SEDD‑based DCoLT achieves 57.0% accuracy, surpassing both standard CoT and DoT baselines. On LLaDA, the LLaDOU model improves both mathematical reasoning accuracy and code‑generation pass rates, outperforming existing diffusion models. Visualization of token generation order shows that early steps prioritize key numbers and operators, while later steps fill in surrounding text, confirming the model’s ability to reason in a flexible, non‑sequential manner.
Conclusion
DCoLT demonstrates that treating diffusion denoising steps as reasoning actions and optimizing them with result‑based reinforcement learning can substantially boost the problem‑solving capabilities of diffusion language models. The approach bridges the gap between human‑like divergent thinking and machine‑generated text, opening new avenues for advanced LLM inference.
References
Paper: "Reinforcing the Diffusion Chain of Lateral Thought with Diffusion Language Models" – arXiv:2505.10446 https://arxiv.org/abs/2505.10446
GitHub repository: https://github.com/maple-research-lab/LLaDOU
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
