Unified Self‑Supervised Pretraining Accelerates Image Generation and Improves Understanding

The USP framework introduces masked latent modeling within a VAE space to pre‑train ViT encoders, enabling seamless weight transfer to both image classification, segmentation, and diffusion‑based generation tasks, dramatically speeding up DiT and SiT models while preserving strong visual representations.

Amap Tech
Amap Tech
Amap Tech
Unified Self‑Supervised Pretraining Accelerates Image Generation and Improves Understanding

Conference Context

ICCV (International Conference on Computer Vision), a top‑tier venue recommended by the Chinese Computer Federation (CCF) as an A‑class conference, will be held in Hawaii from October 19‑25. This year received 11,239 submissions and accepted 2,698 papers (24% acceptance). Five papers from the Gaode technology team were selected.

Paper Overview

Title: USP: Unified Self‑Supervised Pretraining for Image Generation and Understanding

Link: https://arxiv.org/pdf/2503.06132

Code: https://github.com/AMAP-ML/USP

Unified Self‑Supervised Pretraining (USP)

USP performs masked latent modeling (MLM) in the latent space of a variational auto‑encoder (VAE). After pre‑training, the weights of the ViT encoder can be directly transferred to downstream tasks such as image classification, semantic segmentation, and diffusion‑based image generation. The method accelerates DiT‑XL and SiT‑XL training by 11.7× and 46.6× respectively, while maintaining high performance on understanding tasks.

USP overall architecture
USP overall architecture

Key Design Principles

Unified Pretraining Framework: USP bridges image understanding and diffusion‑based generation by pre‑training a ViT encoder on masked VAE latents, allowing weight initialization for downstream tasks.

Decoupled Pretraining and Downstream Tasks: Masked latent modeling eliminates label dependence and speeds up training.

Efficiency and Generality: USP speeds up DiT and SiT training (600K and 150K steps achieve performance of 7M‑step baselines) without extra computational overhead, and combines orthogonally with other acceleration methods.

Research Background

The pretraining‑finetuning paradigm has succeeded in image recognition but remains under‑explored for image generation. Prior works such as iGPT use pixel‑space autoregressive pretraining, which is computationally expensive and incompatible with diffusion models. REPA aligns diffusion models with pretrained visual encoders (e.g., DINOv2) but suffers from high GPU cost and additional teacher‑network overhead.

Key open questions:

Is pretraining beneficial and necessary for diffusion models?

Can a single pretraining method serve both generation and understanding tasks?

Can the traditional pretraining‑finetuning pipeline be applied to generative models?

Observations

P1: Neural networks exhibit robustness to noise, allowing pretrained visual models to retain classification accuracy on noisy inputs.

P2: Diffusion models can learn discriminative features useful for downstream tasks.

P3: ViT architectures adapt well to both recognition and generation when appropriately modified.

P4: VAE latent spaces preserve rich visual information, supporting high‑quality feature learning.

USP Architecture Details

The pipeline encodes input images into VAE latent space, partitions them with PatchConv, randomly masks a subset of patches, feeds unmasked patches to a ViT encoder, and reconstructs masked patches via a decoder using simple MSE loss. During pretraining, VAE parameters are frozen; only the ViT encoder is trained. After pretraining, the encoder weights initialize downstream classifiers (using the class token) and diffusion models (DiT/SiT) with the following adaptations:

Restore trainable bias (β) and scale (γ) in AdaLN‑Zero layers to align with ViT weights.

Upsample positional encodings from 224×224 to 256×256 via bicubic interpolation.

Remove the class token for generative models.

Experimental Results

Image Generation

USP was evaluated on two transformer‑based diffusion models, DiT and SiT, using ImageNet‑256 (50 k samples, no CFG). Across all model scales, USP consistently improved generation quality and reduced training steps. For example, DiT‑XL achieved comparable FID in 400 k steps versus 7 M steps for the baseline. Tables and figures illustrate these gains.

Generation results
Generation results
DiT comparison
DiT comparison
SiT comparison
SiT comparison

USP also demonstrated orthogonal benefits when combined with REPA or VAVAE, achieving superior FID scores with fewer training steps.

Image Understanding

On ImageNet‑1k, USP outperformed MAE in linear probing and matched MAE in fine‑tuning for classification. On ADE20K, USP improved single‑scale mIoU by 0.5% over MAE.

Classification results
Classification results
Segmentation results
Segmentation results

Ablation Studies

Comprehensive ablations examined the impact of VAE components, input resolution, and mask ratios, confirming the importance of each design choice. Detailed results are provided in the original paper.

Future Directions

The authors suggest extending the unsupervised pretraining approach to autoregressive generative models and exploring further efficiency gains without additional GPU memory or training overhead.

Image GenerationDiffusion ModelsVAEself-supervised learningpretrainingViTimage understanding
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.