How Open‑Sora 1.0 Replicates Sora: Architecture, Training Pipeline & Performance Insights

This article provides a comprehensive technical walkthrough of Open‑Sora 1.0, covering its Diffusion‑Transformer architecture, three‑stage training strategy, data‑preprocessing scripts, generation quality, and the Colossal‑AI acceleration that together make Sora‑level video synthesis openly reproducible.

NewBeeNLP
NewBeeNLP
NewBeeNLP
How Open‑Sora 1.0 Replicates Sora: Architecture, Training Pipeline & Performance Insights

Introduction

Open‑Sora 1.0 (https://github.com/hpcaitech/Open-Sora) is an open‑source implementation that reproduces OpenAI’s Sora text‑to‑video generation pipeline. The repository provides the full training code, data‑processing scripts, pretrained weights, and a step‑by‑step tutorial for reproducing high‑quality video synthesis.

Model Architecture

The core model is a Spatial‑Temporal Diffusion Transformer (STDiT) built on the Diffusion Transformer (DiT) architecture. It reuses the PixArt‑α image‑generation backbone, adds a temporal‑attention layer, and integrates a pretrained variational auto‑encoder (VAE) and a text encoder (T5). Each STDiT layer applies 2‑D spatial attention, followed by 1‑D temporal attention, and finally a cross‑attention module that aligns video features with text embeddings. This serial attention design reduces computational cost compared with full‑attention models while leveraging pretrained image DiT weights.

STDiT architecture diagram
STDiT architecture diagram

Training and Inference Flow

During training, video frames are encoded by the pretrained VAE into latent representations. Text prompts are encoded by the T5 encoder. The latent frames and text embeddings are fed to the STDiT diffusion model, which learns to denoise latent video sequences. In inference, Gaussian noise is sampled in the latent space, denoised by STDiT conditioned on a prompt, and the resulting latent video is decoded by the VAE to obtain the final video.

Model training pipeline
Model training pipeline

Three‑Stage Training Scheme

The training follows the Stable Video Diffusion (SVD) blueprint and consists of three sequential stages:

Large‑scale image pretraining : Train a high‑quality text‑to‑image model (PixArt‑α) to obtain strong visual priors.

Large‑scale video pretraining : Initialize STDiT with the image weights, add temporal attention, and train on 256×256 video clips using a T5 text encoder. This stage accelerates convergence and learns spatio‑temporal dynamics.

High‑resolution video fine‑tuning : Fine‑tune the model on a smaller, higher‑resolution dataset to improve fidelity and support longer video lengths.

The video pretraining stage consumed 2 808 GPU‑hours on 64 NVIDIA H800 GPUs (≈ US$7 000). Fine‑tuning required 1 920 GPU‑hours (≈ US$4 500), keeping the total cost around US$10 000.

Three‑stage training pipeline
Three‑stage training pipeline

Data Preprocessing Utilities

The repository includes scripts that automate the following steps:

Downloading public video datasets.

Splitting long videos into short clips based on shot continuity.

Generating detailed captions for each clip using the open‑source LLaVA model. The captioning pipeline runs in ~3 seconds per video on two GPUs and produces video‑text pairs comparable to GPT‑4V quality.

Automated video‑text pair generation
Automated video‑text pair generation

Model Generation Examples

Typical prompts and the corresponding video outputs demonstrate the model’s ability to generate aerial scenes, waterfalls, underwater footage, and cosmic timelapses. (Images omitted for brevity.)

Efficiency Boost from Colossal‑AI

Colossal‑AI provides an acceleration system that applies operator optimizations and hybrid parallelism. On 64‑frame, 512×512 videos, training is 1.55× faster than the baseline, and an 8‑GPU H800 server can train a full 1080p‑minute video without bottlenecks. The mixed spatial‑temporal attention in STDiT yields up to 5× speedup over full‑attention DiT as the frame count grows.

Training acceleration illustration
Training acceleration illustration

Limitations and Future Directions

The current model was trained on ~400 K video clips, which limits generation fidelity and text alignment compared with the original Sora. Observed issues include occasional anatomical errors (e.g., extra limbs) and poor performance on human faces or complex scenes. Planned improvements are to increase dataset size, enhance resolution handling, and add multi‑resolution support for broader industry applications.

Resources

All code, pretrained weights, and documentation are available at https://github.com/hpcaitech/Open-Sora. The project is open for contributions and ongoing maintenance.

Code example

[1] https://arxiv.org/abs/2212.09748 Scalable Diffusion Models with Transformers
[2] https://arxiv.org/abs/2310.00426 PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis
[3] https://arxiv.org/abs/2311.15127 Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
[4] https://arxiv.org/abs/2401.03048 Latte: Latent Diffusion Transformer for Video Generation
[5] https://huggingface.co/stabilityai/sd-vae-ft-mse-original
[6] https://github.com/google-research/text-to-text-transfer-transformer
[7] https://github.com/haotian-liu/LLaVA
[8] https://hpc-ai.com/blog/open-sora-v1.0
Video GenerationDiffusion Transformertraining pipelineAI videoOpen-Sora
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.