GPDiT Sets New SOTA in Video Generation with Faster, Unified Diffusion‑Autoregressive Framework
GPDiT, a novel autoregressive diffusion transformer, unifies diffusion and autoregressive modeling for video generation, introducing lightweight causal attention and a parameter‑free rotation‑based time conditioning that boost temporal consistency and cut training/inference costs, achieving state‑of‑the‑art results on multiple benchmarks.
Problem
Temporal inconsistency : Bidirectional attention in conventional diffusion models lets future frames influence current predictions, breaking causality for long videos.
Training and inference cost : Diffusion‑forcing and similar methods suffer from unstable training and expensive noise‑schedule computation.
Discrete‑token limitation : Autoregressive models that predict discrete tokens cannot naturally represent continuous video dynamics.
GPDiT: Unified Autoregressive‑Diffusion Video Transformer
Continuous‑latent autoregressive diffusion : The diffusion loss is merged with frame‑wise autoregressive prediction of latent representations, preserving motion dynamics and semantic consistency.
Full intra‑frame attention : Each frame retains full self‑attention, while cross‑frame interaction is restricted to causal attention.
Lightweight causal attention : Exploits temporal redundancy by skipping attention computation between clean (non‑noisy) frames, cutting FLOPs dramatically.
Parameter‑free time‑conditioning : Models the forward diffusion step as a 2‑D rotation in the complex plane, removing the need for adaLN‑Zero parameters while still encoding the time step.
Attention Mechanisms
Standard causal attention
Each noisy frame can attend only to earlier clean frames and to itself. This prevents future‑information leakage and works seamlessly with KV‑cache, enabling fast inference for long sequences.
Lightweight causal attention
During training the attention scores between clean frames—accounting for roughly half of the total cost—are omitted. At inference time no KV cache is kept for clean frames, yielding the same computational complexity as a non‑causal model while using far less memory.
Time‑step Conditioning via Complex‑Plane Rotation
The forward diffusion process is reinterpreted as an orthogonal rotation of a 2‑D vector formed by stacking the clean sample x_0 and Gaussian noise \epsilon. With rotation angle \theta_t, the noisy sample at step t becomes:
[ x_t ] = R(\theta_t) * [ x_0 ]
[ \epsilon_t ] [ \epsilon ]where R(\theta_t) is a 2‑D rotation matrix. The inverse rotation recovers x_0 and \epsilon, providing a parameter‑free way to inject the time step into the model. This replaces the parameter‑heavy adaLN‑Zero used in DiT‑style diffusion models.
Experimental Setup
Datasets : UCF‑101 (13,320 videos, 101 actions) and MSR‑VTT (10,000 clips, 20 categories) for generation; UCF‑101 for linear‑probe representation; several few‑shot tasks (grayscale‑to‑color, depth estimation, human detection, Canny‑edge‑to‑image, style transfer) built from 20‑example video sequences each.
Metrics : Video generation evaluated with Fréchet Video Distance (FVD), Fréchet Inception Distance (FID) and Inception Score (IS). Representation measured by top‑1 accuracy of a linear probe on UCF‑101. Few‑shot performance assessed qualitatively.
Implementation details :
Baseline GPDiT‑B: 80 M parameters, Adam optimizer, batch size 96, 400 k training steps on UCF‑101.
Large variant GPDiT‑H: 2 B parameters, pretrained on LAION‑Aesthetic images for 200 k steps (batch 960), then fine‑tuned on mixed image‑video data (equal sampling) for another 200 k steps. Video frames sampled every 3 frames, cropped to 17‑frame clips.
GPDiT‑H‑LONG: additional 150 k steps on pure video data (variable length 17–45 frames) with a reduced learning rate.
Results
Video Generation
On MSR‑VTT, GPDiT‑H achieves FID = 7.4 and FVD = 68, surpassing prior methods without test‑time data exposure. On UCF‑101:
GPDiT‑B: IS = 66.5, FID = 14.8, FVD = 243.
GPDiT‑H‑LONG (trained on 24 M videos): IS = 66.6, FID = 7.9, FVD = 218.
Smaller variants GPDiT‑B‑OF2 and GPDiT‑B‑OF (both 80 M parameters) still obtain competitive FVD scores of 214 and 216 respectively.
Video Representation
Linear probing on UCF‑101 shows that the OF2 attention variant consistently outperforms OF, confirming that interaction between clean context frames improves representation quality. Accuracy peaks in shallow layers and then gradually declines, mirroring findings from REPA. For GPDiT‑H‑OF2, accuracy continues to rise with training steps, and a strong positive correlation is observed between generation quality (lower FVD) and classification accuracy.
Few‑Shot Video Learning
GPDiT‑H can be fine‑tuned with batch size 4 for 500 steps on each downstream task. With as few as 20 examples, the model learns high‑quality transformations for grayscale‑to‑color, depth estimation, human detection, Canny‑edge‑to‑image reconstruction, and two style‑transfer settings, demonstrating emergent in‑context learning comparable to large language models.
Conclusion
GPDiT unifies autoregressive modeling and diffusion in a single transformer, leveraging lightweight causal attention and a rotation‑based, parameter‑free time‑conditioning scheme. These designs cut training and inference FLOPs without sacrificing generation quality, achieve state‑of‑the‑art video synthesis metrics, provide competitive video‑representation features, and enable strong few‑shot generalization across diverse video tasks.
References
[1] Generative Pre‑trained Autoregressive Diffusion Transformer
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
