How DPad Cuts Inference Time 61× While Boosting Accuracy in Diffusion LLMs
The article analyzes a recent Duke University paper that reveals a "scratchpad" mechanism in diffusion large language models, proposes the DPad method to prune redundant suffix tokens before decoding, and demonstrates up to 61.4× faster inference with unchanged or even improved accuracy across multiple benchmarks.
Background
Diffusion large language models (dLLMs) use bidirectional attention over the entire future context, which enables global planning but incurs heavy computational redundancy because every decoder layer attends to all suffix tokens.
Scratchpad Mechanism
Analysis shows that suffix tokens act as a “scratchpad”: each Transformer block writes its tentative future‑text ideas into these tokens (write phase) and later reads them in deeper layers (read phase). This write‑read workflow provides global planning but many of the stored tokens are redundant; attention scores decay with distance and high‑score distant tokens are sparse.
Method – DPad (Diffusion Scratchpad)
DPad removes the ineffective suffix tokens before decoding, keeping only a small informative subset. The procedure consists of two simple steps:
Compute attention‑score statistics for each suffix token (e.g., average attention weight from the current block). Identify tokens whose scores are below a redundancy threshold and discard them.
Retain a local window of tokens around each current block (e.g., the nearest k tokens) to preserve necessary context.
This prior pruning reduces the quadratic attention cost to near‑linear in the retained token count while preserving the write‑read capability of the scratchpad.
Experimental Setup
We evaluated DPad on three representative dLLMs: LLaDA‑1.5, Dream‑Base, and an LLaDA‑Instruct variant. Benchmarks included GSM8K (1024‑token, 1‑shot) and HumanEval (2048‑token, 0‑shot). For each benchmark we measured:
Inference latency (tokens / second) and derived speed‑up factor over the vanilla model.
Strict‑match accuracy (exact format compliance) and flexible‑match accuracy.
Results
When combined with existing inference optimizations (e.g., Flash‑Attention, kernel fusion), DPad achieved:
61.39× speed‑up on GSM8K with LLaDA‑1.5, with strict‑match accuracy improving from 71.2 % to 73.5 %.
97.32× speed‑up on HumanEval with Dream‑Base, while strict‑match accuracy increased from 68.4 % to 70.1 %.
Attention heatmaps show that after pruning the model reallocates attention to the remaining nearby suffix tokens, confirming the “self‑healing” behavior.
In‑Context Learning Analysis
Because DPad discards redundant future tokens, the model’s attention concentrates on the prompt (prefix) region. This leads to higher strict‑match scores, indicating better utilization of the provided examples and format cues.
Implementation Details
The pruning can be implemented as a preprocessing step before each decoding iteration:
def dpad_prune(suffix_tokens, attn_scores, top_k=64, window=8):
# 1. Rank tokens by average attention weight
scores = attn_scores.mean(axis=0) # shape: (seq_len,)
# 2. Keep top‑k tokens globally
top_indices = np.argsort(-scores)[:top_k]
# 3. Add a local window around each current block token
keep = set(top_indices)
for i in range(len(suffix_tokens)):
keep.update(range(max(0, i-window), min(len(suffix_tokens), i+window+1)))
return [suffix_tokens[i] for i in sorted(keep)]This function runs in O(N) time and can be cached for static prompts.
Conclusion and Outlook
DPad demonstrates that a lightweight, prior pruning of suffix tokens can bridge the gap between full‑global‑attention dLLMs and block‑wise efficient variants, delivering orders‑of‑magnitude speed‑ups without sacrificing—and sometimes improving—accuracy. Future work may integrate DPad‑style sparsity into pre‑training or fine‑tuning pipelines to further reduce redundancy.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
