Transfusion: A Single Model for Unified Image Generation and Understanding

Transfusion is a 7B‑parameter transformer that jointly trains language modeling and diffusion losses on mixed text‑image data, enabling seamless text generation, image generation, and image understanding within one model and outperforming prior multimodal approaches such as Chameleon across multiple benchmarks.

AIWalker
AIWalker
AIWalker
Transfusion: A Single Model for Unified Image Generation and Understanding

Overview

Transfusion is a from‑scratch transformer that jointly models discrete text tokens and continuous image vectors. It is trained on a mixed multimodal dataset with two objectives: next‑token language modeling and diffusion‑based image generation. The largest model has 7 B parameters and is trained on 2 T multimodal tokens (≈1 T text tokens and ≈1 T image tokens).

Model architecture

The core is a standard transformer that processes a sequence containing both tokenized text and image patches. Text tokens are embedded via a token‑embedding matrix and processed with causal self‑attention. Image patches are obtained by patchifying VAE‑encoded latent vectors (latent dimension 8) and processed with unrestricted bidirectional attention. Two patch encoders are explored: a simple linear layer and a U‑Net down/up block; the U‑Net provides an inductive bias that improves performance, especially for larger models. The attention mask applies causal masking to text tokens while allowing every image patch to attend to all other patches in the same image.

Training objectives

Each training step receives a mixed sequence and computes both losses. The language‑modeling loss is the standard per‑token cross‑entropy -log P(y_i|y_{<i}). The diffusion loss follows the DDPM formulation: Gaussian noise is added to image patches according to a predefined schedule and the model is trained to predict the denoised patches. A balancing coefficient λ combines the two losses, enabling a single parameter set to be optimized for both modalities.

Inference procedure

Decoding alternates between language‑modeling (LM) and diffusion modes. In LM mode tokens are generated greedily or with temperature/top‑p sampling until a special BOI token signals the start of an image. The model then switches to diffusion mode, initializing a noisy latent and iteratively denoising it for a fixed number of steps (e.g., 250 steps out of 1 000 trained timesteps). Classifier‑free guidance (CFG) can be applied, at the cost of roughly doubling computation. After diffusion finishes an EOI token returns the model to LM mode, allowing arbitrary mixed text‑image outputs.

Experimental setup

Training follows the Chameleon protocol, using identical data, compute budget, and overall architecture except for image handling. Text data are tokenized with the Llama 2 tokenizer and comprise a 2 T token corpus spanning multiple domains. Image data consist of 380 M licensed Shutterstock images resized to 256×256 and encoded by an 86 M‑parameter VAE (latent dimension 8). Five model sizes are trained to study scaling: 0.16 B, 0.37 B, 0.76 B, 1.4 B, and 7 B parameters.

Evaluation

Benchmarks include:

Text‑to‑text perplexity on Wikipedia and C4.

Text‑to‑image quality measured by MS‑COCO FID.

Image captioning quality measured by CIDEr on MS‑COCO.

The GenEval multimodal benchmark.

Baselines are Chameleon (which discretizes images) and Llama 2 for pure‑text tasks.

Results

Across all model sizes Transfusion consistently outperforms Chameleon on log‑FLOPs scaling curves. Notably, Transfusion attains comparable FID with 34× less compute. On pure‑text benchmarks Transfusion matches or exceeds Chameleon, indicating that image‑token competition in Chameleon harms text performance. A 7 B Transfusion model trained on 2 T tokens matches the image quality of state‑of‑the‑art diffusion models such as DALL‑E 2 and SDXL, while achieving text generation quality comparable to Llama 1.

Ablation studies

Patch size. Larger patches reduce inference cost but degrade performance, especially for text. Figure 10 shows the trade‑off for a 0.76 B model.

Encoder/decoder architecture. U‑Net encoders/decoders outperform linear layers even after accounting for the extra parameters (≈3.8 % of total). Figure 11 illustrates that U‑Net variants of smaller transformers can surpass a 7 B linear‑patch model on FID and CIDEr.

Noise schedule. Limiting diffusion noise to the first half of timesteps (t ≤ 500) improves image‑captioning CIDEr scores with less than 1 % impact on other metrics (Figure 12).

Comparison with Chameleon

Scaling curves (Figure 7) show Transfusion’s superior scaling law; the parity FLOP ratio (Figure 8) indicates that Transfusion requires roughly one‑third of the FLOPs to reach the same performance as Chameleon. Text‑only benchmarks (Figure 9) reveal that quantizing image tokens in Chameleon reduces text performance, likely due to competition between token types.

Training details

Image latent representation is produced by a VAE with a CNN encoder/decoder, trained for 1 M steps on 256×256 images reduced to 32×32×8 latent patches (each latent corresponds to an 8×8 image patch). For the VQ‑VAE variant used by Chameleon, a codebook of 16 384 entries replaces the reconstruction loss.

Inference hyper‑parameters

Text generation uses greedy decoding; image generation samples 250 diffusion steps (the model is trained on 1 000 timesteps). CFG weight 5 is used for baseline comparisons, while a weight of 3 is employed in large‑scale experiments to balance quality and compute.

Paper reference

Paper title: Transfusion: Predict the Next Token and Diffuse Images with One Multi‑Modal Model ArXiv link: http://arxiv.org/pdf/2408.11039
Figure 1: Transfusion framework. Discrete (text) tokens are processed autoregressively, continuous (image) vectors are processed in parallel with diffusion loss
Figure 1: Transfusion framework. Discrete (text) tokens are processed autoregressively, continuous (image) vectors are processed in parallel with diffusion loss
Figure 2: Images generated by the 7B Transfusion model trained on 2T multimodal tokens
Figure 2: Images generated by the 7B Transfusion model trained on 2T multimodal tokens
Figure 3: Forward diffusion process
Figure 3: Forward diffusion process
Figure 4: Transfusion attention mask allowing image patches to see each other
Figure 4: Transfusion attention mask allowing image patches to see each other
Figure 5: Evaluation methods
Figure 5: Evaluation methods
Figure 7: Scaling curves showing Transfusion outperforms Chameleon
Figure 7: Scaling curves showing Transfusion outperforms Chameleon
Figure 8: Parity FLOP ratio between Transfusion and Chameleon
Figure 8: Parity FLOP ratio between Transfusion and Chameleon
Figure 9: Text benchmark performance compared to Llama 2
Figure 9: Text benchmark performance compared to Llama 2
Figure 10: Patch size trade‑off
Figure 10: Patch size trade‑off
Figure 11: Linear vs. U‑Net encoder/decoder performance
Figure 11: Linear vs. U‑Net encoder/decoder performance
Figure 12: 7B Transfusion compared with other scale‑matched models
Figure 12: 7B Transfusion compared with other scale‑matched models
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Transformermultimodalimage generationdiffusionAI researchLanguage Modeling
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.