Artificial Intelligence 8 min read

Why WeDLM Outpaces AR Models: Diffusion Decoding Meets KV Cache for 10× Faster Inference

Tencent WeChat AI introduces WeDLM, a diffusion language model that works with standard causal attention and KV caching, achieving up to ten‑fold speedups over autoregressive models while maintaining or improving generation quality across math reasoning and open‑ended tasks.

AI Frontier Lectures

Jan 5, 2026

Why WeDLM Outpaces AR Models: Diffusion Decoding Meets KV Cache for 10× Faster Inference

Introduction

Autoregressive (AR) decoding is the dominant generation paradigm for large language models but suffers from token‑by‑token latency. Diffusion language models (Diffusion LLMs) can recover multiple masked tokens in parallel, yet existing designs cannot surpass highly optimized AR inference engines such as vLLM because they rely on bidirectional attention, which is incompatible with KV‑cache mechanisms.

Core Insight

The key observation of WeDLM is that mask recovery does not require bidirectional attention; each masked position only needs access to all observed tokens. This can be achieved with standard causal attention, making the model fully compatible with industrial‑grade KV‑cache pipelines.

Technical Solutions

Topological Reordering

All observed tokens are moved to the physical front of the sequence while preserving their logical positions via RoPE (rotary) positional encodings. Under a causal mask, masked positions can attend to the complete context without violating causality.

Dual‑Stream Masking

Training uses two parallel streams: a clean “memory” stream that holds the original token sequence and a masked “prediction” stream that shares the same positional encodings. The prediction stream reads context from the memory stream, avoiding the propagation of noisy intermediate predictions.

Streaming Parallel Decoding

Distance Penalty : Prioritizes decoding of leftmost masked tokens, encouraging the growth of a left‑to‑right prefix.

Instant Cache : Once a token is decoded it becomes immediately cacheable under causal attention, enabling reuse in subsequent steps.

Dynamic Sliding Window : Continuously slides a window of mask positions forward, eliminating block‑boundary waiting and keeping the decoder busy.

Experimental Results

Generation Quality

WeDLM‑8B achieves an average score of 74.72 on the base benchmark, outperforming Qwen3‑8B (72.61) by 2.1 points. On GSM8K it improves by 4.2 points and on MATH by 2.8 points. The instruction‑tuned variant reaches 77.53, beating Qwen3‑8B‑Instruct (75.12) and SDAR‑8B‑Instruct (74.22).

Inference Speed

All speed comparisons use vLLM‑optimized AR baselines. In low‑entropy tasks (e.g., counting) WeDLM reaches 1673.3 tokens/s; in medium‑entropy math reasoning it reaches 745.2 tokens/s; and in high‑entropy open‑ended QA it reaches 197.8 tokens/s, representing up to a 10× acceleration over AR models.

Quick Start

Install the library with a single pip command from the GitHub repository: pip install git+https://github.com/tencent/WeDLM.git After installation, use the provided Python API to run inference. Example scripts and detailed usage instructions are available on the project’s GitHub page.

Conclusion

WeDLM demonstrates that diffusion decoding can be made compatible with KV caching, enabling streaming parallel generation that consistently outperforms industrial AR inference in both speed and quality. The authors introduce “prefix cacheability” as a primary design goal for future efficient text‑generation models.

Project homepage: https://wedlm.github.io

GitHub repository: https://github.com/tencent/WeDLM

Model weights: https://huggingface.co/collections/tencent/wedlm

large language models KV cache Parallel Decoding Diffusion Language Model WeDLM

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.