What Are Today’s Unified Generation-and-Understanding Multimodal Model Architectures?

This article surveys current unified generation-and-understanding multimodal large-model architectures, compares LLM-centric and LLM-plus-diffusion designs, extracts common insights, details large-scale training tricks from models like Emu3, Chameleon and Janus, and outlines open research directions for visual encoders.

NewBeeNLP
NewBeeNLP
NewBeeNLP
What Are Today’s Unified Generation-and-Understanding Multimodal Model Architectures?

LLM‑centric architectures

Emu3 – supports video, image, and text tasks; uses a pure autoregressive (AR) loss.

Chameleon – handles image and text tasks with a pure AR loss.

Show‑o – employs bidirectional attention for the image part (similar to MaskGit) and an AR loss for text.

Janus – adopts two visual encoders (ViT for understanding, VAE for generation) and uses pure AR loss for both text and image.

LLM + diffusion architectures

TransFusion – AR loss for text, DDPM loss for images.

JanusFlow – AR loss for text, RF loss for images.

Common observations

LLM‑centric designs scale more easily during training and inference, but current implementations are not yet at massive scale and their performance is modest.

When model size is comparable, adding diffusion improves visual generation compared to pure‑LLM pipelines, raising the question of whether LLMs need better adaptation for generation or simply larger scale.

Using separate visual encoders (ViT for understanding, VAE for generation) benefits both tasks, indicating that visual foundation models still split into distinct “understanding” and “generation” families.

Large‑scale multimodal training details

Chameleon is trained from scratch; it shares many stability tricks such as monitoring output norm as a stability indicator, QK‑Norm to mitigate softmax logit‑shift, and swapping normalization order to improve stability.

Emu3 is also trained from scratch; the authors discuss pre‑training, post‑training, and DPO specifics.

The Janus series papers describe their training framework, duration, and the use of sequence packing to boost efficiency; however, it remains unclear whether the 1.3 B‑parameter findings hold at larger scales.

Future directions

How to design a unified visual encoder that supports both generation and understanding; some work such as “titok” makes progress but has not yet been demonstrated at massive multimodal scale.

Whether a single visual foundation model can handle both generation and understanding tasks, and what proxy tasks or loss functions are needed for such unification.

LLM‑based AR architectures provide good representation and compression but suffer from error accumulation in visual generation; diffusion models can alleviate this, suggesting that a lightweight diffusion head on top of LLM‑derived multimodal features may be a promising mid‑ to long‑term strategy.

large language modelsmultimodaldiffusiontraining techniquesvisual encoder
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.