Can Masked Diffusion Replace Autoregressive Models? Inside Omni-Diffusion
Omni-Diffusion introduces a masked discrete diffusion backbone for any‑to‑any multimodal tasks, replacing the traditional autoregressive paradigm with parallel token decoding, and demonstrates competitive speech, vision, and image generation performance while offering significant inference speedups.
Background and Motivation
Autoregressive (AR) architectures dominate multimodal large language models (MLLMs) such as GPT‑4o and LLaVA, but their sequential token prediction limits parallelism and global context awareness. Masked Discrete Diffusion Models (MDMs) provide parallel decoding and flexible guidance, prompting researchers to explore a unified any‑to‑any model.
Method Overview
Omni-Diffusion converts all modalities into discrete tokens and learns their joint distribution via a masked diffusion process.
1. Modality Tokenization
Image: encoded by MAGVIT‑v2 into 8,192 possible discrete tokens.
Speech: front‑end extracts semantics with SenseVoiceSmall; the back‑end tokenizes speech into 16,384 tokens using GLM‑4‑Voice’s tokenizer.
Text: uses the vocabulary of the underlying language model.
2. Masked Discrete Diffusion Architecture
The core model is a 7‑b parameter Dream‑7B diffusion backbone. Instead of predicting the next token, it predicts masked tokens, learning the joint distribution of multimodal token sequences.
During training, a random subset of tokens is masked; the model reconstructs the original sequence. The loss is defined as the cross‑entropy between the predicted and original tokens, enabling unified understanding (input multimodal → predict text) and generation (input text → predict image/speech tokens).
3. Three‑Stage Progressive Training
Stage 1 – Visual‑Language Pre‑alignment : train on image captioning and text‑to‑image to align visual and textual spaces.
Stage 2 – Multimodal Joint Alignment : add speech data (ASR/TTS) to connect vision, language, and audio.
Stage 3 – Capability Enhancement : fine‑tune on the self‑built SDVI dataset to strengthen “speech‑driven visual interaction”.
The authors also propose an Attenuated Tail‑Pad Masking strategy that scales down the mask probability for padding tokens, preventing the model from over‑focusing on padding during variable‑length generation.
4. Inference Optimizations
Position Penalty : suppress probabilities of tokens near the sequence tail early in decoding to avoid mirrored repetitions in images.
Special Token Pre‑filling : insert a [begin‑of‑speech] marker at 25 % of the sequence, guiding the model to follow text semantics when generating speech.
Adaptive Token Length Allocation : dynamically adjust mask lengths for ASR/TTS based on the strong correlation between speech duration and text length, improving both quality and speed.
Experiments and Results
Speech Tasks
Omni‑Diffusion achieves lower word error rates on LibriSpeech than the Any‑to‑Any baseline AnyGPT, demonstrating strong speech recognition and synthesis capabilities.
Vision Understanding and Generation
On Visual Question Answering benchmarks, Omni‑Diffusion matches or exceeds specialized visual LLMs and outperforms AnyGPT. In text‑to‑image generation, it attains competitive CLIP scores.
Sampling Efficiency
Thanks to parallel decoding, image generation quality remains high even when the diffusion steps are reduced from 256 to just 10.
Real‑World Cases
In a “speech‑driven visual interaction” scenario, the model can answer complex visual questions (e.g., describing an elephant’s social behavior) based on spoken queries, and generate high‑fidelity images from both text and speech prompts.
Conclusion
Omni‑Diffusion proves that autoregressive decoding is not the only path to full‑modal intelligence. By leveraging the parallelism and unified modeling of masked discrete diffusion, it offers a versatile foundation for reading, visualizing, and speaking across modalities.
The code, training pipeline, and dataset details are openly available on GitHub, inviting further research into whether diffusion‑based backbones will eventually replace or complement autoregressive approaches.
Paper: https://arxiv.org/abs/2603.06577
Project page: https://omni-diffusion.github.io
GitHub repository: https://github.com/vita-mllm/omni-diffusion
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
