ByteDance’s Open‑Source 12B‑Parameter Video Model “Alive” Runs on a Single RTX 3090/4090
ByteDance has open‑sourced the 12‑billion‑parameter video generation model Alive, which supports text‑to‑video/audio, image‑to‑video/audio, pure text‑to‑video and text‑to‑audio modes, runs on a 24 GB GPU, outperforms competitors in cross‑modal synchronization, and includes novel TA‑CrossAttn and UniTemp‑RoPE techniques.
After releasing the closed‑source model Seedance 2.0, ByteDance open‑sourced the video generation model Alive. The model has 12 billion parameters and runs on consumer GPUs such as the RTX 3090 or RTX 4090 with 24 GB VRAM.
Alive supports four generation modes: text‑to‑video + audio (T2VA), image‑to‑video + audio (I2VA), pure text‑to‑video (T2V), and text‑to‑audio (T2A). It accepts flexible resolutions, imposes no fixed video‑length limit, and can produce character‑referenced audio‑video animations.
Compared with LTX‑2’s 20 billion‑parameter model, Alive is a lightweight alternative that lowers the hardware barrier considerably.
The most notable advances are audio‑prompt following and audio‑video synchronization, which give Alive a clear edge over competing models in cross‑modal understanding.
Official evaluations pitted Alive against Veo 3.1, Kling 2.6, Wan 2.6, Sora 2, and LTX‑2. Across all test metrics, Alive ranked first, showing balanced capabilities without obvious weaknesses.
Technically, Alive is built on the MMDiT framework. Its 12 B‑parameter VideoDiT module handles video generation, while a 2 B‑parameter AudioDiT module processes audio. The two modules are linked by TA‑CrossAttn (temporal‑alignment cross‑attention), which resolves the mismatch in time granularity between audio and video streams.
UniTemp‑RoPE further maps audio and video data of different formats onto a shared temporal coordinate system, enabling precise correspondence between sound events and visual content for true audio‑video sync.
Data processing receives special attention: beyond standard visual quality filtering, the team applies dual filtering to both audio and video, annotates objects with visual + audio keywords, and optimizes the mapping of characters to speech in multi‑person dialogue scenarios.
During training, the authors observed that the audio branch is highly sensitive to changes in data distribution and tends to “forget” previously learned information. They mitigated this issue with an asymmetric learning‑rate schedule, preventing degradation of audio quality during joint training.
To produce high‑resolution output, Alive includes a cascade refiner that upsamples from 480 p to 1080 p. The video branch refines visual quality, while the audio branch keeps the AudioDiT module frozen to preserve audio fidelity and synchronization.
Community feedback noted that the demo’s ice‑cream‑hand pose appeared stiff, sparking debate over realism. Some developers remarked that, as a stripped version of a closed model, functionality may be reduced, yet the lower hardware requirement (compared with Seedance 2.0’s 96 GB demand) makes Alive more accessible to hobbyists.
While the smaller parameter count may limit generation of highly complex scenes, it enables a broader audience to experiment with video generation.
Project repository: https://foundationvision.github.io/Alive/
AI Engineering
Focused on cutting‑edge product and technology information and practical experience sharing in the AI field (large models, MLOps/LLMOps, AI application development, AI infrastructure).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
