Artificial Intelligence 50 min read

Unraveling Sora: How OpenAI Might Build Its Text‑to‑Video Engine

This article provides a step‑by‑step technical analysis of OpenAI’s Sora model, examining its possible overall architecture, video encoder‑decoder design, Spacetime Latent Patch mechanism, transformer‑based diffusion process, training strategies, and long‑term consistency techniques, while grounding each speculation in publicly available reports and related research.

NewBeeNLP

Mar 22, 2024

Unraveling Sora: How OpenAI Might Build Its Text‑to‑Video Engine

Inferred Architecture of Sora

Based on the OpenAI technical report, Sora can be described as a text‑to‑video latent diffusion system composed of four main modules:

Prompt expansion and text encoding (CLIP‑style encoder).

Video VAE encoder‑decoder (continuous‑latent, likely based on the TECO model).

Spacetime Latent Patch that extracts 2×2 patches from the VAE latent map, flattens them for a transformer, and supports variable resolution and aspect ratio (NaViT‑style).

Transformer‑based diffusion backbone (Video DiT) that predicts noise for each patch conditioned on the text embedding and diffusion timestep.

Visual Encoder‑Decoder (VAE)

The VAE is trained self‑supervised on both images and videos. It uses a continuous latent space rather than a discrete VQ‑VAE/GAN because diffusion operates more naturally on continuous latents. The encoder consists of a causal 3‑D convolution that processes each frame together with a sliding‑window of previous frames, producing a “Space Latent”. A temporal transformer then aggregates long‑range history into a “Time Latent”. The two latents are summed to obtain the final latent representation for each frame.

Spacetime Latent Patch

After VAE encoding, the latent map is partitioned into non‑overlapping 2×2 patches. Each patch is linearly projected and assigned a three‑dimensional position embedding (X, Y, time) so that the transformer can distinguish spatial and temporal locations even when the number of patches varies. This design follows the NaViT approach, which avoids padding by scanning the latent map with a fixed patch size and only padding the batch dimension.

Video DiT Diffusion Model

The diffusion backbone replaces the conventional U‑Net with a transformer block that contains three sub‑modules:

Local Spatial Attention – attends only to patches belonging to the same frame.

Causal Time Attention – attends to all previous frames (or a subset, see below) to enforce temporal consistency.

MLP – non‑linear mixing of the combined spatial‑temporal features.

Conditioning is performed by concatenating the expanded text embedding and the sinusoidal diffusion‑step embedding to the patch sequence. A block‑diagonal 0/1 attention mask ensures that patches from different frames do not attend to each other while allowing full attention within each frame.

Long‑Time Consistency Strategies

Two families of strategies are plausible:

Brute‑Force Attention : during generation of frame i, the model attends to all previous frames (1 … i‑1). This maximizes consistency but scales quadratically with video length.

Flexible Diffusion Modeling (FDM) :

Long‑Range Attention – in addition to recent frames, a small set of distant frames is sampled for attention.

Hierarchical (Key‑frame) Generation – a coarse temporal outline is generated first (key frames), then intermediate frames are filled in, reducing the effective attention horizon.

Both approaches can be combined with the TECO encoder’s long‑range temporal embeddings.

Training Procedure

Sora likely follows a two‑stage training pipeline:

VAE pre‑training : self‑supervised reconstruction of images and videos to learn the encoder‑decoder. The VAE is frozen after convergence.

Diffusion fine‑tuning : the transformer diffusion model learns to predict the added noise given (i) the latent patches, (ii) the CLIP‑style text embedding, and (iii) the diffusion timestep. Position embeddings for variable‑resolution patches are also learned.

Large‑scale paired <text, video> data are generated synthetically. A video‑caption model (VCM) is first trained on a modest manually annotated set, then used to produce detailed captions for massive unlabeled video collections, similar to the synthetic data pipeline used in DALL·E 3.

Training also employs a bidirectional generation scheme: known frames (e.g., a static image or a key frame) are inserted into the noise sequence with a binary mask, and the diffusion model learns to generate both forward and backward in time. This enables flexible generation modes such as image‑to‑video, infinite looping, and seamless video stitching.

Key Takeaways

TECO‑style VAE provides continuous latents with long‑range temporal context.

NaViT‑inspired Spacetime Latent Patch enables variable resolution and aspect‑ratio handling.

Transformer‑based diffusion (Video DiT) with block‑diagonal attention masks replaces U‑Net.

Long‑time consistency is achieved either by brute‑force attention or by FDM techniques.

Synthetic video‑caption data generated by a VCM model is crucial for scaling the training.

diffusion model Sora Transformer Video Generation text-to-video AI analysis

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.