Unified Self‑Supervised Pretraining Boosts Image Generation and Understanding
The USP framework introduces masked latent modeling within a VAE space to pretrain ViT encoders, enabling seamless weight transfer to both image classification and diffusion‑based generation tasks, dramatically accelerating training while preserving strong performance across multiple benchmarks.
Paper Overview
USP (Unified Self‑Supervised Pretraining) proposes a masked latent modeling strategy in the latent space of a variational auto‑encoder (VAE) to pretrain Vision Transformer (ViT) encoders. After pretraining, the learned weights can be directly transferred to downstream tasks such as image classification, semantic segmentation, and diffusion‑based image generation.
Motivation
While the pretraining‑finetuning paradigm has succeeded in image recognition, its application to image generation remains limited. Existing approaches either operate in pixel space with high computational cost or are incompatible with diffusion models.
Method
USP performs masked latent modeling (MLM) on VAE embeddings: random patches are masked, the unmasked patches are fed to a ViT encoder, and a decoder reconstructs the masked patches using a simple MSE loss. During pretraining the VAE parameters are frozen, and only the ViT encoder is updated. The resulting encoder weights serve as a universal initialization for both understanding and generation tasks.
Unified pretraining framework: Bridges image understanding and diffusion‑based generation within a single pretraining pipeline.
Decoupled pretraining and downstream tasks: Masked latent modeling eliminates label dependence and speeds up training.
Efficiency and universality: USP accelerates DiT and SiT diffusion models by 11.7× and 46.6× respectively, while maintaining strong representational power for classification and segmentation.
Experiments
Image generation: USP was evaluated on Transformer‑based diffusion models DiT and SiT using ImageNet‑256 (50 000 samples, no CFG). It consistently improves generation quality across model scales, achieving comparable FID scores with far fewer training steps (e.g., 400 K steps vs. 2.5 M steps for baseline). Results are shown in Figure 1 and Tables 2‑4.
Image understanding: Linear probe and fine‑tuning on ImageNet‑1k demonstrate that USP outperforms MAE on linear probing and matches MAE on fine‑tuning. Segmentation on ADE20K yields a 0.5 % mIoU gain over MAE. See Figures 2‑3 and Table 5‑6.
Ablation studies: Comprehensive ablations on VAE usage, input resolution, and mask ratio confirm each component’s contribution to the overall performance gains.
Future Work
The authors suggest extending the unsupervised pretraining approach to autoregressive generative models and further exploring its efficiency benefits.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
