Artificial Intelligence 16 min read

LangFlow Demonstrates Continuous Diffusion Matching Discrete Models via Better Training

LangFlow revisits continuous diffusion for language modeling, showing that earlier performance gaps were due to suboptimal training and evaluation, and through embedding‑space diffusion, a log‑NSR noise schedule, and a Gumbel‑based information schedule it matches or exceeds discrete diffusion and autoregressive baselines on standard and zero‑shot benchmarks.

Machine Heart

Apr 28, 2026

LangFlow Demonstrates Continuous Diffusion Matching Discrete Models via Better Training

Motivation: Limits of Autoregressive Models

Autoregressive (AR) language models predict the next token one step at a time, which leads to high inference latency (inference steps × per‑step latency) and limited controllability because prompts share equal status with other inputs. Moreover, AR architectures struggle to handle continuous modalities such as images, video, or robot actions, hindering unified multimodal modeling.

History of Diffusion Language Models

Diffusion models dominate image and video generation, prompting researchers to transfer diffusion to text. Early continuous diffusion was believed to have inherent disadvantages, while discrete diffusion—starting from a fully masked or uniformly random token distribution—progressively improved, narrowing the perplexity gap to AR models (e.g., Masked Diffusion achieving <10 PPL difference, Block Diffusion within ~3 PPL).

Why Continuous Diffusion Previously Lagged

Early continuous diffusion models such as Plaid (1 B parameters) performed on par with a 100 M AR transformer, and Diffusion‑LM struggled to generate fluent sentences without conditioning. The community therefore shifted to discrete diffusion, assuming continuous diffusion’s architecture was fundamentally flawed.

LangFlow: Revisiting Continuous Diffusion

The UIUC Liu Lab introduced LangFlow , a continuous‑flow‑matching language model that argues the poor early results were not due to architectural defects but to suboptimal training and evaluation strategies. After systematic optimizations, LangFlow matches discrete diffusion on standard benchmarks and even rivals AR models.

Embedding‑Space Diffusion

LangFlow operates diffusion directly on noisy token embeddings. The model receives a corrupted embedding, predicts the clean token distribution, and analytically computes the denoising target in closed form, preserving the diffusion’s continuous nature while keeping the network architecture comparable to discrete diffusion (embedding input → token distribution output).

LogNSR Noise Schedule

Standard image‑style schedulers allocate most training steps to low‑noise regions where the cross‑entropy loss is near zero, wasting 80 % of compute. LangFlow replaces this with a logarithmic noise‑to‑signal‑ratio (logNSR) schedule, which linearly shifts when NSR doubles, smoothing the loss curve and ensuring the model receives sufficient signal even in high‑noise regimes.

Uniform Information Schedule & Gumbel Modeling

Observing that the CE loss curve quickly plateaus, the authors propose a uniform‑information schedule: sample training timesteps proportionally to the derivative of the information (i.e., the KL term) and discretize generation points with the same weighting. Empirically the derivative follows a Gumbel distribution; fitting this distribution allows the schedule to concentrate computation where information gain is maximal.

Training Objective

Instead of direct mean‑squared error on the denoising target, LangFlow uses a Bregman divergence (specifically cross‑entropy, a special case of Bregman) to align the discrete probability of tokens with the continuous diffusion target, avoiding embedding collapse and mode collapse while keeping a fair comparison with discrete diffusion.

Evaluation Metrics

Two metrics are reported: perplexity (PPL/NLL) measuring likelihood on real data, and Generation PPL (Gen. PPL) which evaluates generated text with a strong reference model (GPT‑2‑Large). For continuous ODE generation, the authors derive a new variational upper bound tailored to the ODE path, yielding a reliable PPL estimate.

Experimental Results

LangFlow (130 M parameters, 1 M training steps) achieves the following:

On LM1B, Generation PPL = 91.8, surpassing the strongest discrete DLM (Duo = 97.6) by >6 points.

Test‑set PPL = 31.7, matching the SOTA discrete MDLM (31.0) and beating all uniform‑noise discrete models.

On OpenWebText, LangFlow’s PPL = 24.3, within 1 point of MDLM (23.2).

In seven zero‑shot transfer tasks, LangFlow exceeds the AR baseline on three tasks and outperforms MDLM on four, with notable gains on PubMed (36.45 vs 49.01) and arXiv (32.84 vs 41.73).

Ablation shows that using the Gumbel‑based schedule alone reduces Generation PPL by roughly an order of magnitude (≈7×).

Self‑conditioning, when enabled, further lowers Gen. PPL from 154.2 to 81.5 and PPL from 49.0 to 30.0.

These results constitute the first instance where a continuous diffusion language model closes the performance gap with discrete diffusion on standard language‑modeling benchmarks.

Implications

LangFlow demonstrates that continuous diffusion retains its native advantages—low‑latency decoding, fine‑grained instruction control, and seamless multimodal integration—while achieving parity with discrete approaches. The authors argue that future language‑model research should focus on combining the strengths of multiple architectures rather than forcing diffusion into an AR‑style paradigm.