Why WeDLM Outpaces AR Models: Diffusion Decoding Meets KV Cache for 10× Faster Inference
Tencent WeChat AI introduces WeDLM, a diffusion language model that works with standard causal attention and KV caching, achieving up to ten‑fold speedups over autoregressive models while maintaining or improving generation quality across math reasoning and open‑ended tasks.
Introduction
Autoregressive (AR) decoding is the dominant generation paradigm for large language models but suffers from token‑by‑token latency. Diffusion language models (Diffusion LLMs) can recover multiple masked tokens in parallel, yet existing designs cannot surpass highly optimized AR inference engines such as vLLM because they rely on bidirectional attention, which is incompatible with KV‑cache mechanisms.
Core Insight
The key observation of WeDLM is that mask recovery does not require bidirectional attention; each masked position only needs access to all observed tokens. This can be achieved with standard causal attention, making the model fully compatible with industrial‑grade KV‑cache pipelines.
Technical Solutions
Topological Reordering
All observed tokens are moved to the physical front of the sequence while preserving their logical positions via RoPE (rotary) positional encodings. Under a causal mask, masked positions can attend to the complete context without violating causality.
Dual‑Stream Masking
Training uses two parallel streams: a clean “memory” stream that holds the original token sequence and a masked “prediction” stream that shares the same positional encodings. The prediction stream reads context from the memory stream, avoiding the propagation of noisy intermediate predictions.
Streaming Parallel Decoding
Distance Penalty : Prioritizes decoding of leftmost masked tokens, encouraging the growth of a left‑to‑right prefix.
Instant Cache : Once a token is decoded it becomes immediately cacheable under causal attention, enabling reuse in subsequent steps.
Dynamic Sliding Window : Continuously slides a window of mask positions forward, eliminating block‑boundary waiting and keeping the decoder busy.
Experimental Results
Generation Quality
WeDLM‑8B achieves an average score of 74.72 on the base benchmark, outperforming Qwen3‑8B (72.61) by 2.1 points. On GSM8K it improves by 4.2 points and on MATH by 2.8 points. The instruction‑tuned variant reaches 77.53, beating Qwen3‑8B‑Instruct (75.12) and SDAR‑8B‑Instruct (74.22).
Inference Speed
All speed comparisons use vLLM‑optimized AR baselines. In low‑entropy tasks (e.g., counting) WeDLM reaches 1673.3 tokens/s; in medium‑entropy math reasoning it reaches 745.2 tokens/s; and in high‑entropy open‑ended QA it reaches 197.8 tokens/s, representing up to a 10× acceleration over AR models.
Quick Start
Install the library with a single pip command from the GitHub repository: pip install git+https://github.com/tencent/WeDLM.git After installation, use the provided Python API to run inference. Example scripts and detailed usage instructions are available on the project’s GitHub page.
Conclusion
WeDLM demonstrates that diffusion decoding can be made compatible with KV caching, enabling streaming parallel generation that consistently outperforms industrial AR inference in both speed and quality. The authors introduce “prefix cacheability” as a primary design goal for future efficient text‑generation models.
Project homepage: https://wedlm.github.io
GitHub repository: https://github.com/tencent/WeDLM
Model weights: https://huggingface.co/collections/tencent/wedlm
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
