Why Larger Blocks Hurt Diffusion Language Model Inference and How T* Solves It

The article analyzes the trade‑off in masked diffusion language models where larger generation blocks increase parallelism but degrade reasoning, and shows how the T* progressive block‑scaling method using trajectory‑aware reinforcement learning stabilizes training and boosts accuracy across block sizes, with up to 15 % gains on MATH500.

Machine Heart
Machine Heart
Machine Heart
Why Larger Blocks Hurt Diffusion Language Model Inference and How T* Solves It

Background and Challenge

Diffusion language models aim to increase parallelism by generating larger blocks of tokens in a single forward pass, but larger block sizes weaken the conditioning information, making denoising decisions harder and often causing training collapse when combined with reinforcement learning.

Proposed Method: T*

T* (Progressive Block Scaling) rearranges the difficulty order of reinforcement learning. It starts from a small‑block diffusion model that already possesses inference ability, trains the denoising trajectory with TraceRL at a fixed block size, and after a prescribed number of updates doubles the block size. The typical schedule is B=4 → B=8 → B=16 → B=32.

Experimental Setup

Experiments were run on SDAR‑1.7B‑Chat and SDAR‑4B‑Chat models, evaluating MATH500, GSM8K and AIME24 with Pass@3. Baselines include the original SDAR checkpoint and a direct application of TraceRL at the same block size.

Results on Accuracy and Stability

For the 4B model at B=8, T* raises MATH500 accuracy from 60.73 % to 76.00 % (+15.27 % over the original checkpoint, +13.90 % over direct TraceRL). Similar gains appear on GSM8K and AIME24. On the 1.7B model at B=32, T* achieves 59.00 % on MATH500 (vs. 54.20 % original and 54.10 % TraceRL) and improves GSM8K from 78.31 % to 82.00 %.

Parallelism Metrics

The paper measures Tokens‑per‑Forward (TPF) to quantify intra‑block parallelism. Autoregressive models have TPF = 1.0. T* increases TPF from 2.95 at B=8 to 3.38 at B=16 and 3.80 at B=32 for the 1.7B model, confirming higher parallelism without reverting to token‑by‑token generation.

Denoising Order Analysis

LocalStrict evaluates how close the denoising order is to strict left‑to‑right (value = 1). T* records LocalStrict = 0.854, 0.804, 0.730 for B=8, 16, 32, indicating that the model does not collapse to a purely autoregressive schedule. Table 1 shows the relationship between denoising order, accuracy and TPF.

Insights on Token Scheduling

Figure 4 visualizes the first‑mask step for each token under TraceRL and T*. Both methods retain non‑monotonic updates, but T* learns a different scheduling policy that better fits the target block size. The authors suggest that reinforcement learning can directly reshape the internal token finalization order, offering a complementary direction to external search‑based inference frameworks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

reinforcement learningDiffusion Language ModelT*Block ScalingMATH500
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.