Artificial Intelligence 14 min read

EasyAnimate: High‑Resolution Video Generation via Diffusion Transformers

EasyAnimate, an open‑source DiT‑based video generation framework from Alibaba Cloud AI Platform PAI, offers a complete pipeline—including data preprocessing, VAE and DiT training, LoRA fine‑tuning, motion‑module integration, and scalable inference up to 768×768 resolution and 144 frames—leveraging Diffusion Transformers to produce longer, higher‑quality videos.

Alibaba Cloud Big Data AI Platform

Jun 4, 2024

EasyAnimate: High‑Resolution Video Generation via Diffusion Transformers

Overview

Recent interest in the Sora model has sparked a wave of open‑source projects that replace the traditional UNet baseline with a Diffusion Transformer architecture to generate longer, higher‑resolution videos. EasyAnimate, developed by Alibaba Cloud AI Platform PAI, is a DiT‑based video generation framework that provides an end‑to‑end solution.

Key Features

Maximum inference resolution 768×768, up to 144 frames (512×512 can run on an A10 24 GB GPU).

Training of the DiT baseline model.

LoRA fine‑tuning on a small set of images to change video style.

VAE model training and inference.

Video data preprocessing.

Data Preprocessing

EasyAnimate integrates with PAI for one‑click training and deployment. It uses PySceneDetect to split long videos into 3‑10 second clips, then applies a series of filters: duration, aesthetic score, text proportion (via EasyOCR), and motion (optical flow). After filtering, video frames are recaptioned using videochat2 and VILA, with a higher‑quality recaption model under development.

Model Architecture

The framework builds on the PixArt‑alpha foundation model, modifying the VAE and DiT structures to better support video generation. A motion module is added to inject temporal information, and a grid‑reshape operation expands the token count for richer spatial context. Skip‑connection structures from U‑ViT are incorporated to stabilize deep DiT training.

Video VAE

EasyAnimate adopts MagViT as the video VAE backbone. MagViT uses causal 3D convolutions with forward padding to preserve temporal causality. To handle extremely long video sequences, EasyAnimate introduces a Slice VAE that compresses video frames along spatial and temporal dimensions, reducing memory consumption while maintaining quality.

Video Diffusion Transformer

Based on the image DiT, EasyAnimate adds the motion module to learn temporal dynamics and uses grid‑reshape to increase attention token coverage. This enables the model to generate coherent motion across frames.

Training Process

Training proceeds in three stages: (1) train DiT on image data to quickly adapt to the new VAE; (2) train the motion module on a large mixed image‑video dataset, allowing the model to generate subtle motion; (3) fine‑tune the full DiT model on a curated high‑quality video dataset. Model sizes progress from 256×256×144 to 512×512×144 and finally 768×768×144.

Scalability

EasyAnimate supports both baseline and LoRA training, offering flexibility for video or image LoRA fine‑tuning. The system includes a minimalistic image dataset for LoRA demos and provides scripts for inference.

Open‑source Resources

GitHub repository: https://github.com/aigc-apps/EasyAnimate

Technical report: https://arxiv.org/abs/2405.18991

Additional references: MagViT, PixArt‑alpha, Open‑Sora, AnimateDiff, U‑ViT, etc.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

video generation LoRA VAE diffusion transformer AI video motion module

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.