LTX-2 Open‑Source: The First Model That Generates Video and Audio Together
LTX-2, an open‑source multimodal diffusion model from Lightricks, jointly generates synchronized video and audio using an asymmetric dual‑stream architecture, achieving 49.18 processing steps per minute—far faster than many pure video models—while supporting about 20 seconds of high‑resolution output.
Problem
Most video generation models produce visual content only, while most audio generation models produce sound only, leaving a gap for synchronized audio‑video synthesis.
LTX‑2 Overview
LTX‑2 is an open‑source multimodal diffusion model that learns a joint distribution over audio and video, enabling a single forward pass to generate speech, ambient sounds, actions, and temporal dynamics together.
Architecture
Asymmetric dual‑stream diffusion transformer.
Video stream: 14 billion parameters (high capacity).
Audio stream: 5 billion parameters (lightweight).
Bidirectional audiovisual cross‑attention links the two streams, eliminating redundant computation.
Deep multilingual text encoder processes input prompts.
Introduces a “thinking token” that improves semantic stability and phonetic accuracy of generated speech.
Performance
Processes 49.18 diffusion steps per minute, versus WAN 2.2 14B’s 2.69 steps per minute.
Generates approximately 20 seconds of synchronized high‑resolution, high‑frame‑rate audio‑video per inference.
Qualitative Capability
Joint training lets the model align sound and image intrinsically, e.g., hand motion synchronized with clapping sounds or lip movements matching spoken words.
Resources
Code and model weights: https://github.com/Lightricks/LTX-2
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
