Artificial Intelligence 12 min read

How LaVin-DiT Unifies Vision Tasks with a Large Diffusion Transformer

The LaVin-DiT paper presents a large vision diffusion transformer that integrates a spatio‑temporal variational auto‑encoder, a joint diffusion transformer with full‑sequence joint attention, and 3D rotary position encoding to enable unified, efficient multi‑task generation for images and videos, and details its training via flow‑matching and experimental results.

AI Frontier Lectures

Jul 8, 2025

How LaVin-DiT Unifies Vision Tasks with a Large Diffusion Transformer

Paper Information

Title: LaVin-DiT: Large Vision Diffusion Transformer

Authors: Zhaoqing Wang, Xiaobo Xia, Runnan Chen, Dongdong Yu, Changhu Wang, Mingming Gong, Tongliang Liu

Code repository: https://derrickwang005.github.io/LaVin-DiT/

Key Innovations

Spatio‑Temporal Variational Auto‑Encoder (ST‑VAE) : encodes images and videos into a compact continuous latent space while preserving spatio‑temporal structure, reducing computational cost.

Joint Diffusion Transformer (J‑DiT) : extends the Diffusion Transformer with full‑sequence joint attention, allowing parallel denoising of conditional and noisy target latents.

Context Learning : uses input‑target pairs as task context to steer the diffusion process toward task‑specific outputs in latent space.

3D Rotational Position Encoding (3D RoPE) : represents visual data as a continuous 3D sequence, providing accurate spatio‑temporal position embeddings.

Method

Problem Setting

Vision tasks (e.g., object detection, panoptic segmentation) are traditionally solved by task‑specific models. LaVin‑DiT formulates a conditional generation problem: given a query (image or video) and a set of input‑target pairs that define a task, the model must generate a prediction that matches the conditional distribution of the desired output.

Framework Overview

The framework combines ST‑VAE and J‑DiT (see Figure 2a). For a chosen task, a set of input‑target pairs is encoded by ST‑VAE into latent vectors. These latents are chunked, flattened, and concatenated with Gaussian‑noised target latents, forming a single sequence. J‑DiT processes the sequence with joint attention to produce a clean latent, which is finally decoded by the ST‑VAE decoder into pixel space.

ST‑VAE

ST‑VAE compresses spatial and temporal dimensions using causal 3‑D convolutions and deconvolutions. The encoder consists of four symmetric stages: the first two down‑sample both space and time (2×), the last two down‑sample only space, yielding a 4×8×8 latent tensor. A KL‑regularization term enforces a Gaussian prior. Training proceeds in two stages: (1) image‑only pre‑training, (2) joint image‑video training, optimizing a weighted sum of MSE, perceptual loss, and adversarial loss.

J‑DiT

J‑DiT builds on the Diffusion Transformer (DiT) but introduces separate 2×2 patch embeddings for clean conditional latents and noisy target latents. The core is full‑sequence joint attention , where conditional and target sequences are linearly projected, concatenated, and processed by a bidirectional attention module. To improve efficiency, grouped‑query attention replaces standard multi‑head attention, sharing key/value heads across query groups. Additional stabilizations include QK‑Norm before the query‑key dot product and sandwich normalization after each attention and feed‑forward block.

Training Process (Conditional Flow Matching)

Given a clean latent z and its noisy counterpart \tilde{z}, flow matching defines a linear interpolation forward process:

z_t = (1 - t) * z + t * \tilde{z},   t \in [0,1]

This induces a time‑varying velocity field v_t that satisfies the ODE dz_t/dt = v_t(z_t, t). J‑DiT is parameterized to predict v_t. The Conditional Flow Matching (CFM) loss directly regresses the predicted velocity to the ground‑truth velocity:

L_{CFM} = \mathbb{E}_{t, z, \tilde{z}} \big\| v_\theta(z_t, t) - (\tilde{z} - z) \big\|^2

Generation Process

After training, sampling starts from a Gaussian noise latent z_T. The learned velocity field is integrated backward in time to obtain a clean latent z_0. Using Euler integration with step size \Delta t:

z_{t-\Delta t} = z_t - \Delta t \cdot v_\theta(z_t, t)

Iterating from t = T down to 0 yields z_0, which is decoded by the ST‑VAE decoder to produce the final prediction.

Inference

For any downstream task, LaVin‑DiT samples a set of input‑target pairs that define the task, concatenates them with the query and Gaussian noise, runs J‑DiT to obtain a latent representation, and finally decodes it with ST‑VAE. No task‑specific fine‑tuning is required.

Experiments

Empirical results demonstrate that a single LaVin‑DiT model can handle diverse vision tasks (object detection, segmentation, etc.) without task‑specific fine‑tuning. Quantitative metrics (e.g., mAP, IoU) reported in the paper show competitive performance relative to specialized baselines, confirming the effectiveness of the unified architecture, 3D RoPE, and the CFM training regime.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision generative modeling Vision Transformers 3D RoPE Joint Diffusion Transformer ST-VAE

Written by

AI Frontier Lectures

Leading AI knowledge platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.