10 min read

How DPad Cuts Inference Time 61× While Boosting Accuracy in Diffusion LLMs

The article analyzes a recent Duke University paper that reveals a "scratchpad" mechanism in diffusion large language models, proposes the DPad method to prune redundant suffix tokens before decoding, and demonstrates up to 61.4× faster inference with unchanged or even improved accuracy across multiple benchmarks.

Data Party THU

Oct 10, 2025

How DPad Cuts Inference Time 61× While Boosting Accuracy in Diffusion LLMs

Background

Diffusion large language models (dLLMs) use bidirectional attention over the entire future context, which enables global planning but incurs heavy computational redundancy because every decoder layer attends to all suffix tokens.

Scratchpad Mechanism

Analysis shows that suffix tokens act as a “scratchpad”: each Transformer block writes its tentative future‑text ideas into these tokens (write phase) and later reads them in deeper layers (read phase). This write‑read workflow provides global planning but many of the stored tokens are redundant; attention scores decay with distance and high‑score distant tokens are sparse.

Method – DPad (Diffusion Scratchpad)

DPad removes the ineffective suffix tokens before decoding, keeping only a small informative subset. The procedure consists of two simple steps:

Compute attention‑score statistics for each suffix token (e.g., average attention weight from the current block). Identify tokens whose scores are below a redundancy threshold and discard them.

Retain a local window of tokens around each current block (e.g., the nearest k tokens) to preserve necessary context.

This prior pruning reduces the quadratic attention cost to near‑linear in the retained token count while preserving the write‑read capability of the scratchpad.

Experimental Setup

We evaluated DPad on three representative dLLMs: LLaDA‑1.5, Dream‑Base, and an LLaDA‑Instruct variant. Benchmarks included GSM8K (1024‑token, 1‑shot) and HumanEval (2048‑token, 0‑shot). For each benchmark we measured:

Inference latency (tokens / second) and derived speed‑up factor over the vanilla model.

Strict‑match accuracy (exact format compliance) and flexible‑match accuracy.

Results

When combined with existing inference optimizations (e.g., Flash‑Attention, kernel fusion), DPad achieved:

61.39× speed‑up on GSM8K with LLaDA‑1.5, with strict‑match accuracy improving from 71.2 % to 73.5 %.

97.32× speed‑up on HumanEval with Dream‑Base, while strict‑match accuracy increased from 68.4 % to 70.1 %.

Attention heatmaps show that after pruning the model reallocates attention to the remaining nearby suffix tokens, confirming the “self‑healing” behavior.

In‑Context Learning Analysis

Because DPad discards redundant future tokens, the model’s attention concentrates on the prompt (prefix) region. This leads to higher strict‑match scores, indicating better utilization of the provided examples and format cues.

Implementation Details

The pruning can be implemented as a preprocessing step before each decoding iteration:

def dpad_prune(suffix_tokens, attn_scores, top_k=64, window=8):
    # 1. Rank tokens by average attention weight
    scores = attn_scores.mean(axis=0)  # shape: (seq_len,)
    # 2. Keep top‑k tokens globally
    top_indices = np.argsort(-scores)[:top_k]
    # 3. Add a local window around each current block token
    keep = set(top_indices)
    for i in range(len(suffix_tokens)):
        keep.update(range(max(0, i-window), min(len(suffix_tokens), i+window+1)))
    return [suffix_tokens[i] for i in sorted(keep)]

This function runs in O(N) time and can be cached for static prompts.

Conclusion and Outlook

DPad demonstrates that a lightweight, prior pruning of suffix tokens can bridge the gap between full‑global‑attention dLLMs and block‑wise efficient variants, delivering orders‑of‑magnitude speed‑ups without sacrificing—and sometimes improving—accuracy. Future work may integrate DPad‑style sparsity into pre‑training or fine‑tuning pipelines to further reduce redundancy.