How EasyAnimate V5 Advances AI Video Generation with Multimodal Control
EasyAnimate V5, an Alibaba Cloud AI video generation framework, expands model size to 7B/12B, introduces multimodal control, token‑length based training, and inpaint‑based image‑to‑video strategies, while providing easy deployment via PAI, DSW, and local ComfyUI integration.
In digital content creation, video plays an increasingly important role, but high‑quality video production is often time‑consuming and costly. The EasyAnimate series leverages artificial‑intelligence techniques to simplify this process, and EasyAnimate V5 builds on its predecessors with improved quality, multimodal data handling, and cross‑language support.
EasyAnimate is an Alibaba Cloud video generation framework based on DiT that offers video preprocessing, VAE training, DiT training, LoRA training, model inference, and evaluation. By fine‑tuning a pretrained model with a small number of images via LoRA, users can change video styles, greatly enhancing extensibility and competitiveness.
Integrated into the AI platform PAI for one‑click training and deployment, EasyAnimate V5 highlights the following features:
Adopts the MMDiT architecture, scaling the model to 7B and 12B parameters.
Supports various control inputs.
Implements a more extensive image‑to‑video strategy.
Utilizes more data and multi‑stage training.
Model Scale and Structure Updates
The model incorporates ideas from CogVideoX and Stable Diffusion 3, linking text and video embeddings via self‑attention, which reduces computation compared with Pixart’s cross‑attention and allows adaptive attention weighting based on different conditions.
To align the disparate feature spaces of text and video, the MMDiT architecture is used with separate to_k, to_q, to_v, and feed‑forward networks for each modality, enabling better multimodal alignment.
Following Flux, the total parameter count is expanded to 7B and 12B.
Video Control
Earlier EasyAnimate V3 achieved image‑to‑video via inpaint; V5 extends this to controllable video generation. A new control signal replaces the original mask, is encoded by a VAE, and combined with latent variables as guidance.
From 26 M pre‑training videos, about 443 K high‑quality clips are selected, with control conditions such as OpenPose, Scribble, Canny, Anime, MLSD, Hed, and Depth. Training proceeds in two stages: the 13312‑token stage (512×512×49) and the 53248‑token stage (1024×1024×49).
Example: the EasyAnimateV5‑12b‑Control model uses a batch size of 128 for 5000 steps in the 13312 stage, and a batch size of 96 for 2000 steps in the 53248 stage.
Trained models accept control conditions to steer the generated video.
Token‑Length Based Training
Training is divided into stages based on token length. Image‑VAE alignment uses a 10 M SAM dataset for about 120 k steps, providing faster and clearer text‑image alignment.
Video training uses three token lengths: 3328 (256×256×49), 13312 (512×512×49), and 53248 (1024×1024×49), with varying data scales and batch sizes:
3328 stage: all 26.6 M videos, batch size 1536, 66.5 k steps.
13312 stage: 720 p+ videos (≈17.9 M), batch size 768, 30 k steps; plus high‑quality 0.5 M videos for image‑to‑video, batch size 384, 5 k steps.
53248 stage: high‑quality 0.5 M videos, batch size 196, 5 k steps.
Mixed‑resolution training enables generation at any resolution between 512 and 1024, with corresponding frame counts (e.g., 49 frames at 512×512, 21 at 768×768, 9 at 1024×1024).
Image‑to‑Video Strategy
The inpaint‑based approach reconstructs masked regions using VAE‑encoded reference frames and concatenated latent vectors. The black area in the diagram indicates the region to be reconstructed, while the white area is the reference image.
Mask information can be resized and combined with the latent to form a 33×13×48×84 tensor, which is fed into the DiT model for noise prediction.
Because mask information is flexible, users can specify start or end frames, or edit specific regions.
Noise sampled from a normal distribution (mean ‑3.0, std 0.5) is added to non‑background reference frames to increase motion amplitude, following practices from CogVideoX and SVD.
Model Usage
EasyAnimate can be launched on DSW, which provides a free 30 GB memory environment supporting EasyAnimateV5‑7b‑zh and EasyAnimateV5‑12b‑zh in qfloat8 at 512 resolution via a Gradio UI.
Local deployment is also supported. Example commands to install the EasyAnimate plugin for ComfyUI:
cd ComfyUI/custom_nodes/
# Git clone the easyanimate itself
git clone https://github.com/aigc-apps/EasyAnimate.git
# Git clone the video output node
git clone https://github.com/Kosinkadink/ComfyUI-VideoHelperSuite.git
cd EasyAnimate/
python install.pyAfter installation, drag the provided JSON files into the ComfyUI interface to generate videos.
Contact
Project repository: https://github.com/aigc-apps/EasyAnimate
DingTalk group: 77450006752
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
