Artificial Intelligence 20 min read

Transfusion: A Single Model for Unified Image Generation and Understanding

Transfusion is a 7B‑parameter transformer that jointly trains language modeling and diffusion losses on mixed text‑image data, enabling seamless text generation, image generation, and image understanding within one model and outperforming prior multimodal approaches such as Chameleon across multiple benchmarks.

AIWalker

Feb 20, 2025

Transfusion: A Single Model for Unified Image Generation and Understanding

Overview

Transfusion is a from‑scratch transformer that jointly models discrete text tokens and continuous image vectors. It is trained on a mixed multimodal dataset with two objectives: next‑token language modeling and diffusion‑based image generation. The largest model has 7 B parameters and is trained on 2 T multimodal tokens (≈1 T text tokens and ≈1 T image tokens).

Model architecture

The core is a standard transformer that processes a sequence containing both tokenized text and image patches. Text tokens are embedded via a token‑embedding matrix and processed with causal self‑attention. Image patches are obtained by patchifying VAE‑encoded latent vectors (latent dimension 8) and processed with unrestricted bidirectional attention. Two patch encoders are explored: a simple linear layer and a U‑Net down/up block; the U‑Net provides an inductive bias that improves performance, especially for larger models. The attention mask applies causal masking to text tokens while allowing every image patch to attend to all other patches in the same image.

Training objectives

Each training step receives a mixed sequence and computes both losses. The language‑modeling loss is the standard per‑token cross‑entropy -log P(y_i|y_{<i}). The diffusion loss follows the DDPM formulation: Gaussian noise is added to image patches according to a predefined schedule and the model is trained to predict the denoised patches. A balancing coefficient λ combines the two losses, enabling a single parameter set to be optimized for both modalities.

Inference procedure

Decoding alternates between language‑modeling (LM) and diffusion modes. In LM mode tokens are generated greedily or with temperature/top‑p sampling until a special BOI token signals the start of an image. The model then switches to diffusion mode, initializing a noisy latent and iteratively denoising it for a fixed number of steps (e.g., 250 steps out of 1 000 trained timesteps). Classifier‑free guidance (CFG) can be applied, at the cost of roughly doubling computation. After diffusion finishes an EOI token returns the model to LM mode, allowing arbitrary mixed text‑image outputs.

Experimental setup

Training follows the Chameleon protocol, using identical data, compute budget, and overall architecture except for image handling. Text data are tokenized with the Llama 2 tokenizer and comprise a 2 T token corpus spanning multiple domains. Image data consist of 380 M licensed Shutterstock images resized to 256×256 and encoded by an 86 M‑parameter VAE (latent dimension 8). Five model sizes are trained to study scaling: 0.16 B, 0.37 B, 0.76 B, 1.4 B, and 7 B parameters.

Evaluation

Benchmarks include:

Text‑to‑text perplexity on Wikipedia and C4.

Text‑to‑image quality measured by MS‑COCO FID.

Image captioning quality measured by CIDEr on MS‑COCO.

The GenEval multimodal benchmark.

Baselines are Chameleon (which discretizes images) and Llama 2 for pure‑text tasks.

Results

Across all model sizes Transfusion consistently outperforms Chameleon on log‑FLOPs scaling curves. Notably, Transfusion attains comparable FID with 34× less compute. On pure‑text benchmarks Transfusion matches or exceeds Chameleon, indicating that image‑token competition in Chameleon harms text performance. A 7 B Transfusion model trained on 2 T tokens matches the image quality of state‑of‑the‑art diffusion models such as DALL‑E 2 and SDXL, while achieving text generation quality comparable to Llama 1.

Ablation studies

Patch size. Larger patches reduce inference cost but degrade performance, especially for text. Figure 10 shows the trade‑off for a 0.76 B model.

Encoder/decoder architecture. U‑Net encoders/decoders outperform linear layers even after accounting for the extra parameters (≈3.8 % of total). Figure 11 illustrates that U‑Net variants of smaller transformers can surpass a 7 B linear‑patch model on FID and CIDEr.

Noise schedule. Limiting diffusion noise to the first half of timesteps (t ≤ 500) improves image‑captioning CIDEr scores with less than 1 % impact on other metrics (Figure 12).

Comparison with Chameleon

Scaling curves (Figure 7) show Transfusion’s superior scaling law; the parity FLOP ratio (Figure 8) indicates that Transfusion requires roughly one‑third of the FLOPs to reach the same performance as Chameleon. Text‑only benchmarks (Figure 9) reveal that quantizing image tokens in Chameleon reduces text performance, likely due to competition between token types.

Training details

Image latent representation is produced by a VAE with a CNN encoder/decoder, trained for 1 M steps on 256×256 images reduced to 32×32×8 latent patches (each latent corresponds to an 8×8 image patch). For the VQ‑VAE variant used by Chameleon, a codebook of 16 384 entries replaces the reconstruction loss.

Inference hyper‑parameters

Text generation uses greedy decoding; image generation samples 250 diffusion steps (the model is trained on 1 000 timesteps). CFG weight 5 is used for baseline comparisons, while a weight of 3 is employed in large‑scale experiments to balance quality and compute.

Paper reference

Paper title: Transfusion: Predict the Next Token and Diffuse Images with One Multi‑Modal Model ArXiv link: http://arxiv.org/pdf/2408.11039

Figure 1: Transfusion framework. Discrete (text) tokens are processed autoregressively, continuous (image) vectors are processed in parallel with diffusion loss

Figure 2: Images generated by the 7B Transfusion model trained on 2T multimodal tokens

Figure 4: Transfusion attention mask allowing image patches to see each other

Figure 7: Scaling curves showing Transfusion outperforms Chameleon

Figure 8: Parity FLOP ratio between Transfusion and Chameleon

Figure 9: Text benchmark performance compared to Llama 2

Figure 11: Linear vs. U‑Net encoder/decoder performance

Figure 12: 7B Transfusion compared with other scale‑matched models

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Transformer Multimodal image generation diffusion AI research Language Modeling

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.