Artificial Intelligence 4 min read

How Sora’s Text-to-Video Model Is Redefining AI‑Generated Video

Sora, a new text‑to‑video AI model, can create one‑minute videos from textual prompts or static images, delivering industry‑leading fidelity, resolution, and coherent motion by using spatial‑temporal patches inspired by ViViT, and shows emergent capabilities that hint at universal physical simulation.

Open Source Linux

Apr 16, 2024

How Sora’s Text-to-Video Model Is Redefining AI‑Generated Video

Sora, a new text‑to‑video large model, can generate one‑minute videos from textual instructions or static images. The generated videos feature intricate scenes, expressive characters, and complex camera movements, and the model can also extend existing videos or fill missing frames.

Overall, Sora achieves industry‑leading performance in video fidelity, length, stability, consistency, resolution, and text understanding, driving a multimodal industry revolution. When trained on sufficiently large data, it exhibits emergent abilities that give video generation models the potential to act as universal simulators of the physical world.

Sora adopts the LLM approach of converting text into tokens, training visual patches for video, achieving a unified representation of visual data, enabling effective processing and generation of diverse video and image content. It then decomposes video via a compression network into spatial‑temporal patches, allowing information exchange and manipulation across time and space.

According to Sora’s technical report, its spatial‑temporal patches draw inspiration from Google’s ViViT. ViViT adapts ViT’s image‑patch tokenization to video by dividing input video into multiple “tuplets,” each becoming a token that undergoes spatial‑temporal attention to produce effective video representation tokens.

Traditional methods often split video into a sequence of frames, ignoring spatial information such as object positions and motions within each frame. Because consecutive frames exhibit spatiotemporal continuity, Sora’s patches consider both temporal and spatial relationships, enabling more precise video generation that captures subtle motions, maintains coherence and length, and creates rich visual effects to meet diverse user needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multimodal AI text-to-video Sora model spatial-temporal patches ViViT

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.