Artificial Intelligence 35 min read

Beyond Sora: Exploring Cutting-Edge Video Reconstruction Techniques

This article surveys recent advances in video reconstruction sparked by OpenAI's Sora, examines the technical challenges of unified latent representations, long‑sequence consistency, and variable resolution, and reviews a range of transformer‑based, diffusion, and masked‑generation models together with their code implementations and future research roadmaps.

Alipay Experience Technology

Apr 28, 2024

Beyond Sora: Exploring Cutting-Edge Video Reconstruction Techniques

Preface

When OpenAI released Sora, a breakthrough video‑generation model, the AI community was stunned. Sora makes 60‑second video synthesis practical and sets a new benchmark for speed and quality. Our team aims not only to catch up with Sora but to surpass it by focusing on video reconstruction, which balances information compression and diffusion efficiency while preserving spatio‑temporal consistency.

Sora’s Requirements for Video Reconstruction

Effective video reconstruction must compress rich video data into a low‑dimensional latent space that supports efficient generation, editing, and precise control over time and space. Sora raises the bar for reconstruction quality and efficiency, demanding:

Unified representation for images and videos.

Temporal consistency of objects across frames.

Scalable handling of long sequences (e.g., 60‑second clips).

Choice between discrete and continuous latent spaces.

Support for variable‑resolution video.

Related Work Survey

We examined several recent video‑reconstruction approaches and extracted core code snippets.

Video‑GPT

Video‑GPT adapts VQ‑VAE and a Transformer to model natural video. It learns a discrete latent representation with 3‑D convolutions and axial self‑attention, then uses an autoregressive decoder to generate video from latent tokens.

Learning Latent Codes

The VQ‑VAE encoder consists of 3‑D convolutions that down‑sample spatial‑temporal dimensions, followed by attention residual blocks (LayerNorm + axial attention). The decoder mirrors the encoder with transposed 3‑D convolutions. Learned spatio‑temporal position embeddings are shared across all attention layers.

Learning a Prior

The second stage trains a prior over VQ‑VAE latents using a GPT‑style Transformer, adding dropout for regularisation. Both unconditional and conditional priors are possible, the latter using cross‑attention with a 3‑D ResNet or conditional norms.

Perceiver‑AR

Perceiver‑AR is a modality‑agnostic autoregressive architecture that maps long inputs to a compact latent space via cross‑attention, then performs causal masked self‑attention within that space. This decouples computational cost from input length while preserving order‑sensitive generation.

Introduce ordered latent processing.

Apply causal‑masked cross‑attention.

Use causal‑masked self‑attention in the latent stack.

Key contributions include a scalable autoregressive framework, validation of long‑context utility, and decoupling of input size from compute.

Key Perceiver‑AR Components

Position encoding (absolute or rotary).

<span>def generate_sinusoidal_features(size, max_len, min_scale, max_scale):</span></code><code><span>  """Generate sinusoidal position encodings"""</span>

Magvit‑2

Magvit‑2 introduces a 3‑D causal convolution tokenizer and a large‑code‑book (LFQ) that enables higher‑capacity discrete representations.

VideoPoet

VideoPoet integrates multiple video‑generation capabilities into a single LLM, using a transformer backbone rather than diffusion.

Multimodal Vocabulary

A multimodal token vocabulary jointly represents video and audio, enabling a pretrained LLM to generate synchronized audiovisual streams.

MaskGIT

MaskGIT predicts a subset of tokens in parallel, iteratively refining masked positions until the full sequence is generated, achieving an order‑of‑magnitude speedup over pure autoregressive decoding.

GumbelVQ vs. VQ

Gumbel‑VQ uses a Gumbel‑Softmax for soft quantisation with a reconstruction loss only, while traditional VQ employs hard nearest‑neighbor assignment with both reconstruction and commitment losses.

# Gumbel‑Softmax soft quantisation
noise = jax.random.gumbel(key, shape)
logits = jnp.log(softmax(-distances + noise))
quantized = jnp.matmul(logits, codebook)

Mask Transformer

Mask‑parallel decoding generates masks, embeds them, applies causal masks in self‑attention, and predicts masked tokens iteratively, improving efficiency for both language and vision tasks.

# Generate mask tokens
masked_indices = np.random.choice(seq_len, mask_ratio * seq_len)
mask = np.ones(seq_len)
mask[masked_indices] = 0

Muse

Muse follows a generate‑then‑super‑resolve pipeline similar to DALL·E 2, using a T5‑XXL text encoder to condition a low‑resolution transformer, then a super‑resolution transformer to upscale.

TECO

TECO employs a Temporal Transformer with causal masks to model long video sequences efficiently, using a VQ‑VAE encoder, a start‑of‑sequence token, and down‑sampling/up‑sampling stages to balance compute and temporal receptive field.

def temporal_transformer(self, z_embeddings, actions, cond, deterministic=False):
    inp = jnp.concatenate([cond, z_embeddings], axis=1)
    actions = jnp.tile(actions[:, :, None, None], (1, 1, *inp.shape[2:4], 1))
    inp = jnp.concatenate([inp[:, :-1], actions[:, 1:]], axis=-1)
    sos = jnp.tile(self.sos[None, None], (z_embeddings.shape[0], 1, 1, 1, 1))
    inp = jnp.concatenate([sos, inp], axis=1)
    deter = self.z_tfm(inp, mask=self._init_mask(), deterministic=deterministic)
    deter = jax.vmap(self.z_unproj, 1, 1)(deter[:, self.config.n_cond:])
    return deter

Genie

Genie is a generative interactive environment trained unsupervised on raw internet videos, combining a spatio‑temporal video tokenizer, an autoregressive dynamics model, and a simple latent‑action model to enable controllable video synthesis.

Summary of Related Work

Two dominant paradigms exist in video generation: diffusion‑based pipelines and transformer‑centric models that unify multiple modalities into a single LLM. Core challenges include long‑range temporal consistency, multimodal encoding, spatio‑temporal integration, and efficient high‑quality decoding.

Innovation Roadmap for Video Reconstruction

Unified image‑video representation via causal VQ‑VAE or temporal transformers.

Enhanced spatio‑temporal fusion using causal convolutions and attention.

Improved long‑sequence handling (e.g., Perceiver‑AR, FDM).

High‑efficiency decoding with MaskGIT‑style parallel generation.

Cascade generation (low‑res → high‑res) inspired by Muse and VAR’s next‑scale prediction.

Diverse mask‑training strategies (MLM, random mask, multi‑scale mask).

Unsupervised learning to reduce reliance on labeled data.

Exploration of discrete vs. continuous latent spaces (GumbelVQ, LFQ).

Our Short‑Term and Long‑Term Goals

Short‑term: Support Sora and multimodal tasks, achieve 16‑64× spatial and 4‑8× temporal compression.

Long‑term: Increase compression without quality loss, boost inference efficiency for video understanding and generation.

AI Transformer generation latent space video reconstruction

Written by

Alipay Experience Technology

Exploring ultimate user experience and best engineering practices

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.