Feb 23, 2025 · Artificial Intelligence

U‑ViT: How a ViT‑Based Diffusion Model Beats DiT and Redefines Image Generation

U‑ViT replaces the convolutional U‑Net backbone of diffusion models with a Vision Transformer, treats time, condition and noisy patches as tokens, adds long skip connections and a lightweight 3×3 convolution, and through extensive ablations and scaling studies achieves state‑of‑the‑art FID scores on unconditional, class‑conditional and text‑to‑image generation tasks.

AdaLNFIDLong Skip Connections

0 likes · 16 min read

U‑ViT: How a ViT‑Based Diffusion Model Beats DiT and Redefines Image Generation