Artificial Intelligence 26 min read

Engineering Practices and Optimizations for Text‑to‑Video Generation Models (OpenSora, CogVideoX) on Bilibili TTV Team

The Bilibili TTV team optimized OpenSora and CogVideoX text‑to‑video models by redesigning data storage with Alluxio, parallelizing VAE encoding, applying dynamic sequence‑parallel and DeepSpeed‑Ulysses attention, adapting GPU code for NPU execution, leveraging profiling‑driven kernel fusion, FlashAttention, and expandable memory to dramatically increase training efficiency and frame throughput, while outlining future pipeline‑parallel and ZeRO‑3 scaling plans.

Bilibili Tech

Mar 4, 2025

Engineering Practices and Optimizations for Text‑to‑Video Generation Models (OpenSora, CogVideoX) on Bilibili TTV Team

In recent years, the rapid development of AI‑generated content (AIGC) has attracted great attention from both academia and industry. OpenAI released the large language model GPT‑4 in early 2023 and the text‑to‑video (T2V) model Sora in early 2024, demonstrating impressive world‑simulation capabilities.

Bilibili, as a UGC‑rich video platform, possesses abundant data and diverse application scenarios for video generation. Leveraging prior experience in LLM training, the Bilibili TTV team explored several text‑to‑video models, focusing on Open‑Sora (released by colossal‑ai) and CogVideoX (released by Zhipu AI). This document summarizes the team’s experiences and insights.

2. TTV Models

2.1 OpenSora adopts a Spatial‑Temporal Diffusion Transformer (STDiT) architecture, which combines a Diffusion Transformer (DiT) with cross‑attention to align textual semantics with video frames. Key features include:

Fusion of spatial and temporal attention modules, enabling simultaneous capture of intra‑frame spatial features and inter‑frame temporal relationships.

Cross‑attention module that tightly couples text embeddings with video generation.

Reduced computational resource demand compared with full‑attention designs.

Effective transfer of pre‑trained image DiT weights to video tasks.

The model incorporates a pre‑trained VAE video encoder and a T5 text encoder. Training first compresses video frames with the VAE encoder, embeds the corresponding text with the T5 encoder, and then trains the STDiT diffusion model in the latent space.

2.2 CogVideoX utilizes a 3D Causal VAE and an expert‑Transformer module. Video and text embeddings are concatenated and processed with full 3‑D attention. To handle the heterogeneous feature spaces, the Transformer employs Expert Adaptive LayerNorm and inserts 3D‑RoPE for relative positional encoding.

3. Engineering Practices

3.1 Data Storage & Loading – Video slice files (~1 MB each) cause inefficiencies when fetched from the traditional BOSS storage. Two solutions were evaluated:

HDFS + file packing: combine many small slices into large chunk files and implement a custom dataset reader to unpack them during training.

HDFS → Alluxio → frontend: Alluxio acts as a distributed in‑memory file system, providing high‑throughput access to small files while keeping the training framework oblivious to the underlying storage.

The team ultimately adopted the Alluxio‑fused HDFS backend, which mimics local file‑system access and synchronizes data from HDFS to local disks.

3.2 Data Pre‑processing Optimization – VAE encoding consumes ~18 % of epoch time and a large amount of GPU memory. Two optimizations were applied:

Data parallelism for the encoding stage: a global communication group distributes the encoding workload across all GPUs, gathers results on rank 0, and stores the embeddings for subsequent training phases.

Offline VAE encoding: persist the encoded embeddings to HDFS/Alluxio, allowing later training runs to bypass the costly encoding step.

The workflow is illustrated in Figure 3‑1 (data read & preprocessing diagram).

3.3 Model Parallelism Optimization – Long video sequences lead to quadratic memory growth in attention. Four industry sequence‑parallel techniques were examined:

Ring Attention – splits the sequence into chunks processed in a ring topology.

Megatron‑SP – partitions LayerNorm and Dropout along the sequence dimension.

DeepSpeed‑Ulysses – uses all‑to‑all communication to split queries, keys, and values while preserving the original attention structure.

Dynamic Sequence Parallel (DSP) – partitions multiple sequence dimensions and employs all‑to‑all exchanges only at dimension‑switch points.

Based on the STDiT’s spatio‑temporal cross‑attention, DSP was selected for OpenSora, while CogVideoX (single‑dimensional transformer) leveraged the DeepSpeed‑Ulysses approach. These strategies increased the trainable frame count from 45 fps (16‑GPU) to 221 fps on the same hardware.

3.4 NPU Adaptation – The training stack combines GPU and NPU resources. Adaptation steps include:

Model adaptation: replace CUDA‑specific operators (e.g., LayerNorm → NPURmsNorm, Conv3d precision) with NPU‑compatible equivalents.

Framework migration: port Megatron‑core components to the Huawei Megatron‑NPU codebase.

Precision verification: align random seeds, data sampling, and VAE noise across GPU and NPU runs; use Huawei’s verification tools to compare operator inputs/outputs.

Observed loss differences (max ≈ 0.46, mean ≈ 0.007) confirm functional parity.

3.5 Profiling‑Driven Optimizations – Profiling revealed that backward passes inadvertently recomputed forward layers, consuming ~20 % of time. Optimizations included:

Selective checkpointing – limit recomputation to attention layers only.

Kernel fusion – replace fragmented GELU kernels with a fused implementation, reducing launch overhead.

Tensor contiguity – apply torch.contiguous before StridedSlice operations to improve memory access patterns.

3.6 FlashAttention on NPU – Standard attention suffers from high memory usage and NaN issues on NPU. FlashAttention replaces the full attention matrix with tiled Q‑K‑V blocks, employs recomputation, and fuses kernels, yielding lower memory consumption and higher throughput.

3.7 Virtual Memory Expansion – PyTorch’s default allocator can fragment memory when batch sizes vary. The NPU’s “memory‑pool expandable segment” dynamically grows a single memory segment, reducing fragmentation and preventing OOM errors in deep models.

4. Future Directions

Planned work includes:

Pipeline Parallelism to further reduce per‑device memory footprints.

Overcoming GroupNorm limitations by splitting group dimensions or redesigning statistics aggregation.

Layer‑wise Zero‑3 (ZeRO‑3) to partition weights, gradients, and optimizer states across devices, enabling larger models.

References

https://arxiv.org/pdf/2205.14135

https://gitee.com/ascend/cann-ops-adv/blob/master/docs/FlashAttentionScore.md

https://gitee.com/ascend/cann-ops-adv/blob/master/docs/FlashAttentionScoreGrad.md

https://gitee.com/ascend/pytorch/blob/master/torch_npu/csrc/core/npu/NPUCachingAllocator.cpp#L240

https://arxiv.org/abs/2205.05198

https://arxiv.org/abs/2309.14509

https://arxiv.org/abs/2403.10266

https://www.alphaxiv.org/abs/2408.06072v1

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

data pipeline FlashAttention text-to-video NPU diffusion transformer model parallelism

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.