Artificial Intelligence 50 min read

Unveiling Sora: How OpenAI Might Build Its Groundbreaking Text‑to‑Video Model

This article provides a detailed, step‑by‑step technical analysis of OpenAI's Sora text‑to‑video system, exploring its overall architecture, visual encoder‑decoder choices, Spacetime Latent Patch design, transformer‑based diffusion model, training strategies, and long‑time consistency mechanisms while referencing relevant research papers and open‑source techniques.

Baobao Algorithm Notes

Mar 22, 2024

Unveiling Sora: How OpenAI Might Build Its Groundbreaking Text‑to‑Video Model

Key Messages

Sora likely uses a temporally consistent transformer (TECO)‑based visual encoder‑decoder rather than MAGVIT‑v2.

The model adopts a "Spacetime Latent Patch" approach, combining space and time latents with 2×2 patches and NaVIT for variable resolution support.

Sora employs a latent diffusion model (LDM) with transformer‑based Video DiTs, replacing the traditional U‑Net backbone.

Long‑time consistency may be achieved through aggressive attention mechanisms (e.g., FDM) or by extending time attention across many frames.

Overall Structure Inference

Sora takes a short user prompt, expands it with a large language model (e.g., GPT), and encodes the resulting detailed text using a CLIP‑style text encoder. The text embedding conditions a diffusion process that generates video frames sequentially.

Visual Encoder‑Decoder

The encoder‑decoder is most plausibly a VAE variant. Continuous latent VAE is favored because diffusion models work best with continuous representations, while discrete VAE (VQ‑VAE/VQ‑GAN) would introduce unnecessary quantization loss. TECO, a temporally consistent transformer VAE, matches Sora's need for long‑range temporal information and aligns with OpenAI's preference for transformer‑centric architectures.

Spacetime Latent Patch

After VAE encoding, each frame’s space latent and time latent are merged into a single patch matrix (2×2 patch size). NaVIT is used to handle arbitrary resolutions and aspect ratios by linearly flattening patches and adding learned 3‑D position embeddings (X, Y, Z) for each patch.

Transformer Diffusion Model (Video DiTs)

Sora replaces the conventional U‑Net diffusion backbone with a transformer‑based Video DiT. The transformer contains three sub‑modules: Local Spatial Attention (masked so patches from different frames cannot attend to each other), Causal Time Attention (allowing each frame to attend only to past frames), and an MLP for non‑linear mixing. Conditioning information (text embedding and diffusion timestep) is concatenated to the patch tokens.

Training Procedure

Training proceeds in two stages. First, a VAE is trained self‑supervised on large image and video datasets to learn the encoder‑decoder. Second, the diffusion transformer is trained on paired <text, video> data while freezing the VAE and CLIP text encoder. Sora likely uses massive synthetic video‑caption data generated by a video‑caption model (VCM) similar to DALLE‑3's image‑caption pipeline, enabling high‑quality supervision.

Long‑Time Consistency Strategies

Two plausible strategies are discussed: (1) a brute‑force approach where each frame’s time attention sees all previous frames, and (2) Flexible Diffusion Modeling (FDM) which introduces either random long‑range attention tokens or a hierarchical scheme that first predicts key future frames and then refines intermediate frames.

Bidirectional Generation

Sora supports flexible generation modes (e.g., start‑from‑image, infinite loop, reverse generation) by inserting known frames into the diffusion sequence using binary masks. This mask‑based conditioning allows the model to generate forward and backward from a fixed anchor, improving temporal coherence.

Conclusion

By combining a TECO‑style continuous VAE, a 2×2 Spacetime Latent Patch with NaVIT‑based positional encoding, and a transformer‑driven latent diffusion model, Sora can generate high‑quality, variable‑resolution videos up to 60 seconds while maintaining long‑range temporal consistency. The analysis highlights the trade‑offs between computational cost and quality, and points to open research directions such as more efficient long‑range attention and better synthetic data pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Sora diffusion text-to-video model analysis

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.