How LaVin-DiT Revolutionizes Vision Generation with ST‑VAE and Joint Diffusion Transformer

The LaVin-DiT paper introduces a large‑scale vision diffusion transformer that combines a spatiotemporal variational auto‑encoder, a joint diffusion transformer with full‑sequence joint attention, and 3D rotary position encoding to enable unified, efficient generation across diverse visual tasks such as segmentation and video prediction.

Data Party THU
Data Party THU
Data Party THU
How LaVin-DiT Revolutionizes Vision Generation with ST‑VAE and Joint Diffusion Transformer

Paper Information

Title: LaVin-DiT: Large Vision Diffusion Transformer

Authors: Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu

Source code: https://derrickwang005.github.io/LaVin-DiT/

Key Innovations

Spatiotemporal Variational Auto‑Encoder (ST‑VAE) : encodes images and videos into a continuous latent space while preserving spatiotemporal features, achieving a 4×8×8 compression ratio and reducing computational cost.

Joint Diffusion Transformer (J‑DiT) : extends the Diffusion Transformer with full‑sequence joint attention, enabling parallel denoising of conditional and noisy target sequences.

Contextual Learning : uses input‑target pairs as task context to guide the diffusion transformer toward task‑specific latent representations.

3D Rotary Position Encoding (3D RoPE) : treats visual data as a continuous 3‑D sequence and provides accurate spatiotemporal position embeddings.

Method Overview

The framework formulates vision generation as a conditional generation problem. A query (image or video) and a set of input‑target pairs defining a task are encoded by ST‑VAE into latent codes. These codes are concatenated with a noisy version of the target latent and fed to J‑DiT, which denoises the combined sequence to produce a clean latent representation. The ST‑VAE decoder then reconstructs the pixel‑level output.

Framework diagram
Framework diagram

ST‑VAE Details

ST‑VAE uses causal 3‑D convolutions for encoding and 3‑D deconvolutions for decoding. It consists of an encoder, a decoder, and a latent regularization layer arranged in four symmetric stages with alternating 2× down‑sampling and up‑sampling. The first two stages down‑sample both spatial and temporal dimensions; the last stage down‑samples only spatially, achieving a 4×8×8 compression. A KL‑regularization term enforces a Gaussian latent distribution.

Training proceeds in two phases: (1) pre‑training on image data only, and (2) joint training on images and videos. The loss combines mean‑squared error, perceptual loss, and adversarial loss. Temporal padding at the start of convolutions prevents future information leakage, and the first video frame is processed separately to preserve temporal independence.

Joint Diffusion Transformer (J‑DiT)

J‑DiT builds on the Diffusion Transformer (DiT) with the following modifications:

Separate 2×2 patch embeddings for clean conditional latents and noisy target latents.

Adaptive RMS normalization (AdaRN) that modulates each representation independently.

Grouped‑query attention replaces standard multi‑head attention for memory‑efficient computation.

QK‑Norm is applied before the query‑key dot product to control attention entropy.

Sandwich (three‑stage) normalization follows each attention and feed‑forward block.

Training Process

Training uses a conditional flow‑matching objective in latent space. Given a clean latent z and a noisy latent z_t, a linear interpolation defines a forward process that induces a time‑varying velocity field. J‑DiT parameterizes this velocity field, and the loss directly regresses the target velocity (Conditional Flow Matching loss).

Inference

At inference, a query and a randomly sampled set of input‑target pairs for the desired task are combined with Gaussian noise and processed by J‑DiT to generate a clean latent representation. The ST‑VAE decoder then maps this latent back to the pixel domain, yielding the task‑specific prediction.

Experiments

LaVin‑DiT was evaluated on multiple vision tasks, including image segmentation and video prediction. It achieved superior performance and higher efficiency compared with task‑specific baselines. Detailed quantitative results and visualizations are provided in the original paper.

Experimental results
Experimental results
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

computer visiondiffusion modelGenerative AIvision transformer3D RoPEjoint diffusionspatiotemporal VAE
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.