Artificial Intelligence 10 min read

Unified Self‑Supervised Pretraining Boosts Image Generation and Understanding

The USP framework introduces masked latent modeling within a VAE space to pretrain ViT encoders, enabling seamless weight transfer to both image classification and diffusion‑based generation tasks, dramatically accelerating training while preserving strong performance across multiple benchmarks.

Amap Tech

Jul 11, 2025

Unified Self‑Supervised Pretraining Boosts Image Generation and Understanding

Paper Overview

USP (Unified Self‑Supervised Pretraining) proposes a masked latent modeling strategy in the latent space of a variational auto‑encoder (VAE) to pretrain Vision Transformer (ViT) encoders. After pretraining, the learned weights can be directly transferred to downstream tasks such as image classification, semantic segmentation, and diffusion‑based image generation.

Motivation

While the pretraining‑finetuning paradigm has succeeded in image recognition, its application to image generation remains limited. Existing approaches either operate in pixel space with high computational cost or are incompatible with diffusion models.

Method

USP performs masked latent modeling (MLM) on VAE embeddings: random patches are masked, the unmasked patches are fed to a ViT encoder, and a decoder reconstructs the masked patches using a simple MSE loss. During pretraining the VAE parameters are frozen, and only the ViT encoder is updated. The resulting encoder weights serve as a universal initialization for both understanding and generation tasks.

Unified pretraining framework: Bridges image understanding and diffusion‑based generation within a single pretraining pipeline.

Decoupled pretraining and downstream tasks: Masked latent modeling eliminates label dependence and speeds up training.

Efficiency and universality: USP accelerates DiT and SiT diffusion models by 11.7× and 46.6× respectively, while maintaining strong representational power for classification and segmentation.

Experiments

Image generation: USP was evaluated on Transformer‑based diffusion models DiT and SiT using ImageNet‑256 (50 000 samples, no CFG). It consistently improves generation quality across model scales, achieving comparable FID scores with far fewer training steps (e.g., 400 K steps vs. 2.5 M steps for baseline). Results are shown in Figure 1 and Tables 2‑4.

Image understanding: Linear probe and fine‑tuning on ImageNet‑1k demonstrate that USP outperforms MAE on linear probing and matches MAE on fine‑tuning. Segmentation on ADE20K yields a 0.5 % mIoU gain over MAE. See Figures 2‑3 and Table 5‑6.

Ablation studies: Comprehensive ablations on VAE usage, input resolution, and mask ratio confirm each component’s contribution to the overall performance gains.

Future Work

The authors suggest extending the unsupervised pretraining approach to autoregressive generative models and further exploring its efficiency benefits.