Artificial Intelligence 38 min read

How to Build a Real‑Time AI‑Powered Anime‑Style Video Generator for Social Apps

This technical report details the end‑to‑end workflow for integrating an AIGC video generation module into a social app, covering requirement analysis, model and hardware selection, dataset construction, LoRA and full‑parameter training, multiple acceleration techniques such as Sage Attention, TeaCache, XDiT, gradient‑checkpointing offload, tiled VAE, and quantization, followed by extensive performance evaluation and metric‑based ranking of the final models.

Alibaba Cloud Developer

Dec 18, 2025

How to Build a Real‑Time AI‑Powered Anime‑Style Video Generator for Social Apps

Project Overview

The goal is to generate a short 720p dancing anime‑style video from a real‑person photo ("real‑person to dancing anime fairy"). The solution uses Tongyi Wanxiang (Wan) diffusion video models, custom fine‑tuning (LoRA and full‑parameter), and a suite of acceleration techniques to meet quality, speed, and GPU‑memory constraints for production deployment.

Requirements

Convert real‑person images to anime‑style characters.

Generate complex dynamic actions (e.g., dancing) with stable motion.

Produce 720p video at a stable frame rate (15 fps, 5 s, 75 frames).

Maintain consistent style, color, and motion across frames.

Model and Compute Selection

Two 14‑billion‑parameter Wan video models were evaluated:

Wan2.1‑I2V‑14B‑720P : strong anime style, supports keyword, style tag, and composition control.

Wan2.2‑I2V‑A14B : high‑noise and low‑noise variants, excels at realistic rendering.

Benchmarking showed Wan2.1 best for the anime‑style task, while Wan2.2 serves as a realistic baseline. GPU memory analysis selected a machine (type 4) with sufficient VRAM to hold a full 75‑frame video (≈42 GB for Wan2.1, ≈51 GB for Wan2.2) at reasonable cost.

Dataset Construction

A multimodal dataset was built:

50 small‑sample groups for quick local validation.

5 000 full‑scale entries for production‑grade training.

Each entry contains a text prompt, the first‑frame image, a control video, and a VACE video. The data are split 9:1 into training and test sets. Dataset organization example:

Training Strategies

Two fine‑tuning approaches were applied:

LoRA : low‑rank adaptation on attention (q,k,v,o) and the first two feed‑forward layers. Target modules = "q,k,v,o,ffn.0,ffn.2", rank = 32.

Full‑parameter : train all diffusion backbone weights.

Because Wan2.1 occupies ~42 GB VRAM and Wan2.2 ~51 GB, the maximum trainable frames differ (LoRA: 60 frames for Wan2.1, 45 frames for Wan2.2; full‑parameter: 41 frames for Wan2.1, 25 frames for Wan2.2). Training was run on the PAI DSW environment with the following dependencies:

torch>=2.0.0
torchvision
transformers
imageio
imageio[ffmpeg]
safetensors
einops
sentencepiece
protobuf
modelscope
ftfy
pynvml
pandas
accelerate
peft

Example LoRA training command for Wan2.1:

accelerate launch examples/wanvideo/model_training/train.py \
  --dataset_base_path train_dataset \
  --dataset_metadata_path train_dataset/metadata.csv \
  --height 720 \
  --width 1280 \
  --num_frames 60 \
  --dataset_repeat 10 \
  --model_paths '[
    ["wan21/diffusion_pytorch_model-00001-of-00007.safetensors",
     "wan21/diffusion_pytorch_model-00002-of-00007.safetensors",
     ...],
    "wan21/models_t5_umt5-xxl-enc-bf16.pth",
    "wan21/Wan2.1_VAE.pth",
    "wan21/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth"]' \
  --learning_rate 1e-4 \
  --num_epochs 2 \
  --remove_prefix_in_ckpt "pipe.dit." \
  --output_path "./models/train/Wan2.1-I2V-14B-720P_lora" \
  --lora_base_model "dit" \
  --lora_target_modules "q,k,v,o,ffn.0,ffn.2" \
  --lora_rank 32 \
  --extra_inputs "input_image" \
  --use_gradient_checkpointing_offload

Full‑parameter training uses the same script with --num_frames 41 (Wan2.1) or --num_frames 25 (Wan2.2).

Inference example (LoRA weights loaded):

import torch
from PIL import Image
from diffsynth import save_video
from diffsynth.pipelines.wan_video_new import WanVideoPipeline, ModelConfig

pipe = WanVideoPipeline.from_pretrained(
    torch_dtype=torch.bfloat16,
    device="cuda",
    model_configs=[
        ModelConfig(path=[
            "wan21/diffusion_pytorch_model-00001-of-00007.safetensors",
            "wan21/diffusion_pytorch_model-00002-of-00007.safetensors",
            ...
        ]),
        ModelConfig(path="wan21/models_t5_umt5-xxl-enc-bf16.pth"),
        ModelConfig(path="wan21/Wan2.1_VAE.pth"),
        ModelConfig(path="wan21/models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth"),
    ],
    use_usp=True,
)
pipe.load_lora(pipe.dit, "models/train/Wan2.1-I2V-14B-720P_lora/epoch-2.safetensors", alpha=1)
pipe.enable_vram_management()

image = Image.open("1.jpg")
prompt = """Flower‑Fairy. A radiant metamorphosis unfolds as the character, encircled by shimmering butterflies, rises from a whirling vortex of whimsical 2D halos..."""
video = pipe(
    prompt=prompt,
    negative_prompt="色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走",
    input_image=image,
    seed=0,
    tiled=True,
    height=720,
    width=1280,
)
save_video(video, "video_lora.mp4", fps=15, quality=5)

Optimization Techniques

Sage Attention (≈27% train & infer speed‑up)

Sage Attention replaces the standard dot‑product attention with an INT8‑based kernel (mixed INT8 Q/K, FP16 P/V) and Triton‑generated kernels for NVIDIA Tensor Cores. It yields 2‑3× speed‑up over Flash‑Attention on consumer GPUs.

git clone https://github.com/thu-ml/SageAttention.git
cd SageAttention
export EXT_PARALLEL=4 NVCC_APPEND_FLAGS="--threads 8" MAX_JOBS=32
python setup.py install

from sageattention import sageattn
attn_output = sageattn(q, k, v, tensor_layout="HND", is_causal=False)
F.scaled_dot_product_attention = attn_output
# Wan models already support Sage; just enable it in the training command.

TeaCache (≈28% infer speed‑up)

TeaCache caches latent representations of low‑change regions between consecutive frames, avoiding redundant diffusion steps. The key parameter tea_cache_l1_thresh (0.02‑0.1) controls the similarity threshold.

for frame_idx in range(total_frames):
    text_emb = encode_text(prompt)
    noise = get_initial_noise()
    if frame_idx > 0:
        cache_mask = tea_cache.get_cache_mask(prev_latent, current_latent)
    latent = diffusion_model(noise, text_emb, cache_mask=cache_mask)
    tea_cache.update_cache(latent, frame_idx)
    frame = vae.decode(latent)

XDiT Sequence Parallelism (≈400% train & infer speed‑up)

XDiT provides several parallelism primitives:

PipeFusion : patch‑wise pipeline parallelism.

USP : unified sequence parallelism across long sequence dimensions.

CFG Parallel : splits classifier‑free guidance branches.

DistVAE : distributed VAE encoder/decoder.

Compilation : torch.compile + OneDiff for kernel fusion.

git clone https://github.com/xdit-project/xDiT.git
cd xDiT
pip install -e .
# optional flash‑attention support
pip install -e "[flash-attn]"

accelerate launch examples/wanvideo/model_training/train.py \
  --dataset_base_path data/example_video_dataset \
  --height 720 --width 1280 \
  --num_frames 49 --dataset_repeat 100 \
  --model_id_with_origin_paths "Wan-AI/Wan2.1-I2V-14B-720P:diffusion_pytorch_model*.safetensors,Wan-AI/Wan2.1-I2V-14B-720P:models_t5_umt5-xxl-enc-bf16.pth,Wan-AI/Wan2.1-I2V-14B-720P:Wan2.1_VAE.pth,Wan-AI/Wan2.1-I2V-14B-720P:models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth" \
  --learning_rate 1e-4 \
  --num_epochs 5 \
  --output_path ./models/train/Wan2.1-I2V-14B-720P_lora \
  --lora_base_model dit --lora_target_modules "q,k,v,o,ffn.0,ffn.2" --lora_rank 32 \
  --extra_inputs "input_image" \
  --use_gradient_checkpointing_offload

Gradient Checkpointing Offload (≈13% VRAM reduction)

Activations are stored on CPU and recomputed during back‑propagation, reducing GPU memory by ~20 % with a modest compute overhead.

# Enable in training configuration
self.use_gradient_checkpointing = True
self.use_gradient_checkpointing_offload = True

Tiled VAE (≈11% VRAM reduction)

VAE encoding/decoding is performed on overlapping tiles (default size 30×52, stride 15×26), lowering peak VRAM while preserving visual quality.

video = pipe(
    prompt="...",
    negative_prompt="...",
    seed=0,
    tiled=True,
    tile_size=[30,52],
    tile_stride=[15,26]
)

Quantization (≈33% train speed‑up)

Wan2.2‑FP8 (ModelScope URL: https://modelscope.cn/models/muse/Wan2.2-I2V-A14B-FP8) reduces parameter size, aligning its VRAM footprint with Wan2.1 and allowing more frames per training step.

Results and Evaluation

Training steps are computed as frames × samples × repeats. For Wan2.1 LoRA: 60 frames × 50 samples × 10 repeats = 30 000 steps.

Quantitative metrics:

PSNR (higher is better; ≥32 dB indicates high quality).

SSIM (0–1 range; ≥0.85 indicates good structural similarity).

import cv2, numpy as np

def calculate_psnr(ref_path, gen_path):
    cap_ref = cv2.VideoCapture(ref_path)
    cap_gen = cv2.VideoCapture(gen_path)
    psnr_list = []
    while True:
        ret_ref, frame_ref = cap_ref.read()
        ret_gen, frame_gen = cap_gen.read()
        if not ret_ref or not ret_gen:
            break
        if frame_ref.shape != frame_gen.shape:
            frame_gen = cv2.resize(frame_gen, (frame_ref.shape[1], frame_ref.shape[0]))
        mse = np.mean((frame_ref - frame_gen) ** 2)
        psnr = float('inf') if mse == 0 else 20 * np.log10(255.0 / np.sqrt(mse))
        psnr_list.append(psnr)
    return np.mean(psnr_list)

psnr_value = calculate_psnr("reference_video.mp4", "generated_video.mp4")
print(f"PSNR: {psnr_value:.2f} dB")

from skimage.metrics import structural_similarity as ssim
import cv2, numpy as np

def calculate_ssim(ref_path, gen_path):
    cap_ref = cv2.VideoCapture(ref_path)
    cap_gen = cv2.VideoCapture(gen_path)
    ssim_list = []
    while True:
        ret_ref, frame_ref = cap_ref.read()
        ret_gen, frame_gen = cap_gen.read()
        if not ret_ref or not ret_gen:
            break
        if frame_ref.shape != frame_gen.shape:
            frame_gen = cv2.resize(frame_gen, (frame_ref.shape[1], frame_ref.shape[0]))
        gray_ref = cv2.cvtColor(frame_ref, cv2.COLOR_BGR2GRAY)
        gray_gen = cv2.cvtColor(frame_gen, cv2.COLOR_BGR2GRAY)
        score, _ = ssim(gray_ref, gray_gen, full=True)
        ssim_list.append(score)
    return np.mean(ssim_list)

ssim_value = calculate_ssim("reference_video.mp4", "generated_video.mp4")
print(f"SSIM: {ssim_value:.4f}")

Subjective evaluation (1‑5 score) covered dynamic performance, camera control, frame quality, and target accuracy. The final weighted score is:

total = (subjective/20)*0.5 + (PSNR/50)*0.25 + (SSIM/1)*0.25

Overall, Wan2.1‑full achieved the highest visual quality but required substantially more compute. Considering training cost, inference speed, and VRAM usage, the LoRA‑fine‑tuned Wan2.1 model was selected as the production‑ready solution.

Conclusion

The pipeline demonstrates a systematic approach to customizing large‑scale diffusion video models for a niche "real‑person to anime fairy" task. By combining model selection, tailored dataset creation, LoRA/full‑parameter fine‑tuning, and acceleration techniques (Sage Attention, TeaCache, XDiT, gradient‑checkpointing offload, tiled VAE, quantization), a production‑grade solution was achieved that runs on consumer‑level GPUs while meeting quality and latency requirements.

Model Optimization Quantization Diffusion Models AI video generation Training Acceleration LoRA fine-tuning Sage Attention TeaCache

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.