Artificial Intelligence 6 min read

How Alibaba Cloud’s Open‑Source Wan 2.1 Sets New Benchmarks in Video Generation

Alibaba Cloud’s newly open‑sourced visual generation model Wan 2.1 achieves a VBench score of 86.22%, outperforms leading models, runs on consumer‑grade GPUs with only 8.2 GB VRAM, and supports multi‑task video creation, marking a significant step for open‑source video AI.

AI Product Manager Community

Feb 26, 2025

How Alibaba Cloud’s Open‑Source Wan 2.1 Sets New Benchmarks in Video Generation

Release Overview

On 2025‑02‑25 Alibaba Cloud released the open‑source visual generation foundation model Wan 2.1. The source code, model weights and inference scripts are hosted at https://github.com/Wan-Video/Wan2.1.

Model Scale and Benchmark

Wan 2.1 is offered in a professional version with 140 billion parameters. On the VBench benchmark it achieves an overall score of 86.22 %, outperforming Sora, Luma and Pika across motion quality, visual fidelity, style consistency and multi‑objective handling.

Hardware Requirements

The 1.3 billion‑parameter variant can generate 480 p video using ≤8.2 GB VRAM, allowing execution on consumer‑grade GPUs. On an NVIDIA RTX 4090 a 5‑second 480 p clip is produced in approximately 4 minutes without quantization or additional optimizations.

Supported Tasks

Wan 2.1 provides a unified interface for:

Text‑to‑video

Image‑to‑video

Video editing (in‑place modification)

Text‑to‑image

Video‑to‑audio generation

Architecture

The model builds on the DiT (Diffusion Transformer) backbone and adopts a linear‑noise‑trajectory Flow Matching training paradigm. A causal 3D VAE encodes video frames into a latent space, and a feature‑cache mechanism enables arbitrary‑length video encoding and decoding.

Training Pipeline

Data preparation follows a four‑step cleaning pipeline: (1) collection of large‑scale image and video datasets, (2) duplicate detection, (3) quality filtering using automated metrics, and (4) final curation for diversity. Training employs distributed strategies to reduce memory consumption and increase throughput, including Fully Sharded Data Parallel (FSDP), RingAttention and the Ulysses optimizer.

Open‑Source Impact

By releasing the code and weights, Alibaba Cloud invites researchers and developers to reproduce, extend, and integrate Wan 2.1 into downstream applications, facilitating further research on multimodal video generation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision video generation benchmark Alibaba Cloud

Written by

AI Product Manager Community

A cutting‑edge think tank for AI product innovators, focusing on AI technology, product design, and business insights. It offers deep analysis of industry trends, dissects AI product design cases, and uncovers market potential and business models.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.