Can Masked Diffusion Replace Autoregressive Models? Inside Omni-Diffusion

Omni-Diffusion introduces a masked discrete diffusion backbone for any‑to‑any multimodal tasks, replacing the traditional autoregressive paradigm with parallel token decoding, and demonstrates competitive speech, vision, and image generation performance while offering significant inference speedups.

AI Frontier Lectures
AI Frontier Lectures
AI Frontier Lectures
Can Masked Diffusion Replace Autoregressive Models? Inside Omni-Diffusion

Background and Motivation

Autoregressive (AR) architectures dominate multimodal large language models (MLLMs) such as GPT‑4o and LLaVA, but their sequential token prediction limits parallelism and global context awareness. Masked Discrete Diffusion Models (MDMs) provide parallel decoding and flexible guidance, prompting researchers to explore a unified any‑to‑any model.

Method Overview

Omni-Diffusion converts all modalities into discrete tokens and learns their joint distribution via a masked diffusion process.

1. Modality Tokenization

Image: encoded by MAGVIT‑v2 into 8,192 possible discrete tokens.

Speech: front‑end extracts semantics with SenseVoiceSmall; the back‑end tokenizes speech into 16,384 tokens using GLM‑4‑Voice’s tokenizer.

Text: uses the vocabulary of the underlying language model.

2. Masked Discrete Diffusion Architecture

The core model is a 7‑b parameter Dream‑7B diffusion backbone. Instead of predicting the next token, it predicts masked tokens, learning the joint distribution of multimodal token sequences.

During training, a random subset of tokens is masked; the model reconstructs the original sequence. The loss is defined as the cross‑entropy between the predicted and original tokens, enabling unified understanding (input multimodal → predict text) and generation (input text → predict image/speech tokens).

3. Three‑Stage Progressive Training

Stage 1 – Visual‑Language Pre‑alignment : train on image captioning and text‑to‑image to align visual and textual spaces.

Stage 2 – Multimodal Joint Alignment : add speech data (ASR/TTS) to connect vision, language, and audio.

Stage 3 – Capability Enhancement : fine‑tune on the self‑built SDVI dataset to strengthen “speech‑driven visual interaction”.

The authors also propose an Attenuated Tail‑Pad Masking strategy that scales down the mask probability for padding tokens, preventing the model from over‑focusing on padding during variable‑length generation.

4. Inference Optimizations

Position Penalty : suppress probabilities of tokens near the sequence tail early in decoding to avoid mirrored repetitions in images.

Special Token Pre‑filling : insert a [begin‑of‑speech] marker at 25 % of the sequence, guiding the model to follow text semantics when generating speech.

Adaptive Token Length Allocation : dynamically adjust mask lengths for ASR/TTS based on the strong correlation between speech duration and text length, improving both quality and speed.

Experiments and Results

Speech Tasks

Omni‑Diffusion achieves lower word error rates on LibriSpeech than the Any‑to‑Any baseline AnyGPT, demonstrating strong speech recognition and synthesis capabilities.

Speech task performance comparison
Speech task performance comparison

Vision Understanding and Generation

On Visual Question Answering benchmarks, Omni‑Diffusion matches or exceeds specialized visual LLMs and outperforms AnyGPT. In text‑to‑image generation, it attains competitive CLIP scores.

Vision task performance comparison
Vision task performance comparison

Sampling Efficiency

Thanks to parallel decoding, image generation quality remains high even when the diffusion steps are reduced from 256 to just 10.

Image quality at different sampling steps
Image quality at different sampling steps

Real‑World Cases

In a “speech‑driven visual interaction” scenario, the model can answer complex visual questions (e.g., describing an elephant’s social behavior) based on spoken queries, and generate high‑fidelity images from both text and speech prompts.

Speech‑driven visual interaction example
Speech‑driven visual interaction example
Text‑to‑image and speech‑to‑image results
Text‑to‑image and speech‑to‑image results

Conclusion

Omni‑Diffusion proves that autoregressive decoding is not the only path to full‑modal intelligence. By leveraging the parallelism and unified modeling of masked discrete diffusion, it offers a versatile foundation for reading, visualizing, and speaking across modalities.

The code, training pipeline, and dataset details are openly available on GitHub, inviting further research into whether diffusion‑based backbones will eventually replace or complement autoregressive approaches.

Paper: https://arxiv.org/abs/2603.06577

Project page: https://omni-diffusion.github.io

GitHub repository: https://github.com/vita-mllm/omni-diffusion

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

multimodal AIlarge language modelsParallel Decodingmasked diffusionOmni-Diffusionspeech-visual interaction
AI Frontier Lectures
Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.