May 11, 2025 · Artificial Intelligence

Unified Multimodal Understanding and Generation: A 30K‑Word Survey of Recent Advances

This comprehensive survey reviews the rapid progress of multimodal understanding and text‑to‑image generation models, categorises existing unified architectures into diffusion‑based, autoregressive, and hybrid paradigms, analyses their tokenisation strategies, datasets and benchmarks, and highlights current challenges and future research directions.

Autoregressive Modelsbenchmarksdatasets

0 likes · 64 min read

Unified Multimodal Understanding and Generation: A 30K‑Word Survey of Recent Advances