Artificial Intelligence 64 min read

Unified Multimodal Understanding and Generation: A 30K‑Word Survey of Recent Advances

This comprehensive survey reviews the rapid progress of multimodal understanding and text‑to‑image generation models, categorises existing unified architectures into diffusion‑based, autoregressive, and hybrid paradigms, analyses their tokenisation strategies, datasets and benchmarks, and highlights current challenges and future research directions.

AIWalker

May 11, 2025

Unified Multimodal Understanding and Generation: A 30K‑Word Survey of Recent Advances

Introduction

Large language models (LLMs) such as LLaMA, PanGu, Qwen and GPT have dramatically expanded AI capabilities, and recent extensions like LLaVA, Qwen‑VL, InternVL and GPT‑4o demonstrate powerful multimodal understanding. Parallel advances in image generation—Stable Diffusion (SD) series, FLUX and others—have produced high‑quality images. While multimodal understanding has largely followed an autoregressive (AR) paradigm and image generation a diffusion paradigm, the community is increasingly interested in unified frameworks that can both understand and generate across modalities. This article systematically surveys the state‑of‑the‑art in unified multimodal models to guide future research.

Multimodal Understanding Models

Multimodal understanding models extend LLMs to process visual inputs. Early visual‑language (VLU) models such as CLIP, ViLBERT, VisualBERT and UNITER used dual encoders to align image and text embeddings. Recent work favours decoder‑only architectures that freeze or lightly fine‑tune a large LLM backbone and connect visual embeddings via lightweight adapters:

MiniGPT‑4 : projects CLIP image embeddings into Vicuna token space with a single learnable layer.

BLIP‑2 : uses a query transformer to bridge a frozen visual encoder (e.g., Flan‑T5 or Vicuna) with a frozen LLM.

Flamingo : employs gated cross‑attention to connect a frozen visual encoder with a Chinchilla decoder.

Newer models such as GPT‑4V, Gemini‑Ultra, Qwen‑VL, Qwen2‑VL, LLaVA‑1.5, LLaVA‑Next, InternVL and Ovis further improve multimodal reasoning, instruction following and image generation capabilities.

Text‑to‑Image Generation Models (Diffusion Models)

Diffusion models (DM) define generation as a forward noising Markov chain and a learned reverse denoising process. The forward process adds Gaussian noise at each timestep t to data x_0:

q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}\,x_{t-1}, \beta_t I)

The reverse process learns a conditional distribution p_\theta(x_{t-1}|x_t) parameterised by a neural network that predicts the mean and variance of the denoised data. Training minimises the variational lower‑bound (VLB) by predicting the added noise \epsilon:

L_{simple}=\mathbb{E}_{t, x_0, \epsilon}\big[\|\epsilon-\epsilon_\theta(x_t, t)\|^2\big]

Early diffusion models used U‑Net backbones (e.g., DDPM, DDIM). Later variants split into:

Pixel‑level diffusion (e.g., GLIDE, Imagen) that operate directly on pixel space but are computationally expensive.

Latent diffusion models (LDMs) that perform diffusion in the latent space of a pretrained VAE, dramatically reducing cost while preserving quality (e.g., Stable Diffusion, SD‑XL, UPainting).

Transformer‑based diffusion (DiT) treats images as patch sequences and injects timestep embeddings as conditioning. Recent extensions such as REPA, SD 3.0, and various multimodal diffusion models incorporate large language model (LLM) text encoders to improve alignment.

Autoregressive Models

AR models factorise the joint distribution of a token sequence into a product of conditional probabilities, predicting each token given all previous ones. For multimodal data, visual inputs are tokenised and concatenated with text tokens, enabling a single decoder to model both modalities.

Tokenisation strategies fall into four categories:

Pixel‑based encoding : images are encoded by VQ‑GAN, VQ‑VAE‑2 or similar autoencoders into discrete tokens (e.g., PixelRNN, PixelCNN, PixelCNN++). Models such as LWM, Chameleon, ANOLE, Emu3, SynerGen‑VL and UGen adopt this approach, often interleaving image and text tokens for joint generation.

Semantic encoding : pretrained vision encoders aligned with text (e.g., CLIP, EVA‑CLIP, SigLIP, UNIT) produce dense embeddings that serve as visual tokens. Examples include Emu, Emu2, LaViT, DreamLLM, VL‑GPT, MM‑Interleaved, PUMA, Mini‑Gemini and MetaMorph.

Learnable query encoding : a set of learnable query tokens extracts task‑specific visual features from a frozen encoder (e.g., SEED, MetaQueries). These queries are processed by a causal transformer and can be paired with diffusion decoders (e.g., SEED‑LLAMA, SEED‑X).

Hybrid encoding : combines pixel‑level and semantic tokens. Pseudo‑hybrid methods (Janus, Janus‑Pro, OmniMamba, Unifluid) train separate encoders for understanding and generation but activate only one at inference. Joint‑hybrid methods (MUSE‑VL, VARGPT, ILLUME+) fuse both token streams for simultaneous use.

AR models excel at sequence efficiency and can leverage large‑scale language modelling techniques, but pixel‑based tokenisation often yields long sequences and limited semantic alignment, while semantic encoders may lack fine‑grained visual detail.

Fused Autoregressive & Diffusion Models

Hybrid AR+diffusion frameworks generate text tokens autoregressively and image tokens via a diffusion process, combining the reasoning power of LLMs with the high‑fidelity synthesis of diffusion. Representative models include:

Transfusion : a unified transformer with modality‑specific sub‑layers handling discrete text tokens and continuous latent image vectors.

MonoFormer : shares modules between AR and diffusion tasks using attention masks.

LMFusion : freezes the LLM and injects a lightweight visual module for image generation.

Show‑o : uses a MAGVIT‑v2 discrete pixel encoder to produce image tokens compatible with the transformer decoder.

Pixel‑based fused models typically employ SD‑VAE latent vectors (e.g., Transfusion, MonoFormer, LMFusion), while hybrid fused models concatenate semantic and pixel tokens (e.g., Janus‑Flow). Challenges include increased computational cost, alignment between the two modalities, and the need for efficient training pipelines.

Unified Multimodal Model Architectures

Unified models consist of three core components:

Modality‑specific encoders that project each input (text, image, video, audio) into a shared representation space.

Modality‑fusion backbone (often a transformer) that integrates the encoded signals and performs cross‑modal reasoning.

Modality‑specific decoders that generate outputs in the desired modality (text, image, video, etc.).

Three high‑level architecture families are identified:

Diffusion‑based unified models (e.g., Dual Diffusion) that run parallel diffusion chains for text and image latent spaces and exchange cross‑modal conditioning at each timestep.

Autoregressive unified models that serialise all modalities into a single token stream and predict them sequentially.

Hybrid AR+diffusion models that combine the two generation strategies as described above.

Figure 5 in the original article illustrates these families and their sub‑categories (pixel‑based, semantic, learnable‑query, hybrid encodings).

Datasets for Unified Multimodal Models

Large‑scale, high‑quality multimodal datasets are essential. They are grouped into:

Understanding datasets (e.g., RedCaps, Wukong, LAION‑5B, COYO, DataComp, ShareGPT4V, CapsFusion‑120M, GRIT, SAM) that provide image‑text pairs for tasks like captioning, VQA and retrieval.

Text‑to‑image datasets (e.g., CC‑12M, LAION‑Aesthetics, Mario‑10M, AnyWord‑3M, JourneyDB, CosmicMan‑HQ 1.0, PixelProse, Megalith, PD12M) that focus on high‑quality image synthesis.

Image‑editing datasets (e.g., InstructPix2Pix, MagicBrush, HQ‑Edit, SEED‑Data‑Edit, UltraEdit, OmniEdit, AnyEdit) containing (source image, edit instruction, target image) triples.

Interleaved image‑text datasets (e.g., MMC4, OBELICS, CoMM) that embed images within long text documents to train models on mixed sequences.

Conditional generation datasets (e.g., LAION‑Face, MultiGen‑20M, Subjects200K, SynCD) that support tasks such as identity‑preserving generation, multi‑control generation, and theme‑driven synthesis.

Several pipelines are used to create synthetic data, including image‑based annotation (BLIP‑2, Grounding‑DINO, SAM) and video‑based pipelines (SAM2) to increase diversity and reduce duplication.

Benchmarks

Evaluation of unified multimodal models spans understanding, generation, and interleaved generation:

Understanding Benchmarks

Perception : image‑text retrieval (Flickr30k, COCO Captions), VQA (VQA, VQA‑v2, TextVQA), chart understanding (ChartQA), spatial reasoning (VSR).

Reasoning : CLEVR, GQA, OK‑VQA/A‑OK‑VQA, VCR, MathVista.

Image‑Generation Benchmarks

Standard metrics : FID, CLIPScore.

Fine‑grained evaluation : GenEval (object count, color control, relational composition), GenAI‑Bench (human‑rated prompts), HRS‑Bench (accuracy, robustness, fairness, bias), DPG‑Bench (dense prompts).

Editing benchmarks : MagicBrush, HQ‑Edit, I2EBench, EditVAl, Emu‑Edit, HumanEdit, GEdit‑Bench.

Interleaved Generation Benchmarks

InterleavedBench : measures text quality, visual fidelity, multimodal consistency.

ISG : scene‑graph annotation and multi‑level scoring.

OpenING, OpenLEAF, MMIE : open‑domain instruction following and multimodal storytelling.

Challenges and Opportunities

Key challenges include:

Long token sequences caused by high‑dimensional visual data; efficient tokenisation and compression are needed.

Cross‑modal attention scalability; sparse or hierarchical attention may alleviate bottlenecks.

Noisy or biased training data; robust filtering, de‑biasing and synthetic data generation are crucial.

Lack of unified evaluation protocols that jointly assess understanding, generation, editing and interleaved capabilities.

Most current unified models focus on image understanding and text‑to‑image generation, with editing and any‑to‑any modality still under‑explored. Addressing architecture design, training efficiency, dataset curation and comprehensive benchmarks offers rich research opportunities.

Conclusion

This survey provides a systematic overview of unified multimodal models that integrate visual‑language understanding with image generation. It categorises existing work into diffusion‑based, autoregressive, and hybrid paradigms, details tokenisation strategies, summarises representative datasets and benchmarks, and discusses open challenges. By consolidating these insights, the article aims to serve as a valuable resource for researchers advancing unified multimodal AI.

References

[1] Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities.

multimodal AI Datasets Diffusion Models Survey benchmarks autoregressive models unified models

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.