U‑ViT: How a ViT‑Based Diffusion Model Beats DiT and Redefines Image Generation

U‑ViT replaces the convolutional U‑Net backbone of diffusion models with a Vision Transformer, treats time, condition and noisy patches as tokens, adds long skip connections and a lightweight 3×3 convolution, and through extensive ablations and scaling studies achieves state‑of‑the‑art FID scores on unconditional, class‑conditional and text‑to‑image generation tasks.

AIWalker
AIWalker
AIWalker
U‑ViT: How a ViT‑Based Diffusion Model Beats DiT and Redefines Image Generation

U‑ViT: A ViT‑Based Diffusion Model

The recent popularity of Diffusion Transformers (DiT) has sparked interest in transformer‑based diffusion backbones. The paper All are Worth Words: A ViT Backbone for Diffusion Models (CVPR 2023) predates DiT and proposes U‑ViT , which replaces the convolutional U‑Net architecture with a Vision Transformer (ViT) while preserving the overall U‑shaped structure.

U‑ViT treats every input element—time embedding, conditioning information, and noisy image patches—as tokens. Between shallow and deep transformer layers it inserts long skip connections to pass low‑level features directly to the decoder, and optionally adds a 3×3 convolution before the final output to suppress transformer‑induced artifacts.

1.1 Using ViT for Diffusion

Diffusion models inject noise into data and learn to reverse this process. By segmenting the image into patches and embedding time and condition as additional tokens, U‑ViT formulates noise prediction as a token‑wise regression problem, leveraging the standard ViT design of stacked transformer blocks.

1.2 Diffusion Model Primer

The forward process is a Markov chain that adds Gaussian noise according to a schedule βₜ. The reverse process is approximated by a neural network that predicts the added noise ϵ; the optimal mean of the reverse distribution can be expressed analytically (see the original diffusion papers).

1.3 Concrete Design Choices

U‑ViT’s architecture (see Figure 1) includes:

Tokenization of time, condition, and image patches.

Long skip connections between early and late transformer layers.

An optional 3×3 convolution before the final image reconstruction.

1.4 Ablation Studies

Long‑skip connection variants – Five ways of merging the main branch and the skip branch were tested (concatenation + linear projection, direct addition, linear projection of the skip then addition, addition + linear projection, and no skip). The concatenation‑then‑linear method performed best, while plain addition under‑performed even the no‑skip baseline.

AdaLN (adaptive layer normalization) – Two strategies were compared: treating the time embedding as a token versus inserting an adaptive group‑norm after the layer‑norm in each transformer block. Direct tokenization of time outperformed AdaLN.

Extra convolution after the transformer – Adding a 3×3 convolution after the linear projection of token embeddings gave a slight edge over inserting it before the projection or omitting it entirely.

Patch embedding variants – A simple linear projection of patches to token embeddings beat a stacked 3×3‑conv + 1×1‑conv alternative.

Position‑encoding variants – The learnable 1‑D positional encoding used in the original ViT outperformed a 2‑D sinusoidal scheme; arbitrary encodings caused the model to fail to generate meaningful images.

Scaling depth, width, and patch size – Experiments on CIFAR‑10 showed that increasing depth from 9 to 13 layers improves performance, but deeper models plateau. Width (hidden size) benefits up to 512 dimensions, after which gains vanish. Smaller patch sizes (2 × 2) improve FID, while further reduction to 1 × 1 offers no additional benefit. Small patches are important because diffusion’s noise‑prediction task is low‑level.

1.5 Training Details

U‑ViT was trained with AdamW. Batch sizes and iterations were:

CIFAR‑10 & CelebA 64×64: bs = 128, 500 K iterations.

ImageNet 64×64 & 256×256: bs = 1024, 300 K iterations.

ImageNet 512×512: bs = 1024, 500 K iterations.

MS‑COCO 256×256: bs = 256, 1 M iterations.

1.6 Experimental Results

Unconditional generation – On CIFAR‑10 and CelebA 64×64, U‑ViT matches U‑Net and surpasses GenViT.

Class‑conditional generation – On ImageNet 64×64, a 131 M‑parameter U‑ViT‑M achieves FID 5.85 (better than the 100 M‑parameter U‑Net‑IDDPM’s 6.92). Scaling to 287 M parameters (U‑ViT‑L) reduces FID to 4.26. In the latent‑space setting, U‑ViT attains FID 2.29 on ImageNet 256×256, outperforming prior diffusion models.

Text‑to‑image generation – Using Stable Diffusion’s CLIP text encoder to provide tokenized prompts, U‑ViT‑S reaches FID 5.48 on MS‑COCO 256×256 without any large external dataset, and the deeper U‑ViT‑S (Deep) improves further. Qualitative comparisons (Figure 11) show U‑ViT producing more accurate objects and better text‑image alignment than a U‑Net baseline.

1.7 Conclusions

The study demonstrates that a pure ViT backbone can serve as a viable and often superior alternative to the traditional convolutional U‑Net for diffusion‑based image synthesis. Long skip connections are crucial, small patch sizes are beneficial for low‑level noise prediction, and many architectural tweaks (AdaLN, extra convolutions, position encodings) have measurable impact.

U‑ViT architecture diagram
U‑ViT architecture diagram
Diffusion forward process
Diffusion forward process
Optimal reverse mean formula
Optimal reverse mean formula
Conditioned noise prediction
Conditioned noise prediction
Long skip ablation results
Long skip ablation results
Extra convolution ablation
Extra convolution ablation
Patch embedding ablation
Patch embedding ablation
Position encoding ablation
Position encoding ablation
Depth, width, and patch size impact
Depth, width, and patch size impact
FID results on CIFAR‑10, CelebA, ImageNet
FID results on CIFAR‑10, CelebA, ImageNet
Different U‑ViT configurations
Different U‑ViT configurations
MS‑COCO FID comparison
MS‑COCO FID comparison
Qualitative text‑to‑image samples
Qualitative text‑to‑image samples
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

diffusion modelimage generationVision TransformerFIDPatch EmbeddingAdaLNLong Skip ConnectionsU-ViT
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.