Unlock 1‑Minute AI Video Generation with TTT‑Video‑Dit: Break the 3‑Second Limit
TTT‑Video‑Dit is an open‑source framework that uses test‑time‑training and hierarchical attention to generate coherent 63‑second videos with style‑transfer, dramatically reducing GPU memory requirements so a single RTX 4090 can replace costly H100 clusters, enabling creators and developers to produce long AI videos efficiently.
Overview
TTT-Video-Dit is an open‑source framework for generating long video clips (up to 63 s) with consistent style using test‑time training (TTT) and a layered attention architecture. It builds on the CogVideoX‑5B diffusion model and runs on a single consumer‑grade GPU.
Key Technical Features
Segment‑extension + global attention : Starts from a 3‑second seed, fine‑tunes style, then iteratively expands to 9 s, 18 s, 30 s and finally 63 s. Reported continuity improvement ≈80 % over naïve stitching.
Unified style‑transfer : Local attention preserves short‑clip detail, while an added TTT layer enforces global style consistency. Improves fine‑detail fidelity by ~3× compared with conventional style‑transfer pipelines.
GPU memory optimisation : Hierarchical training and parameter reuse cut peak memory from ~48 GB (H100) to ~24 GB; an optional low‑memory mode reduces it to ~16 GB, enabling inference on RTX 4090.
Extensible configuration : Video length (3‑63 s), resolution (up to 1080p), and style template (10+ built‑in) are configurable via a YAML file.
Typical Use Cases
Social‑media content creation
Generate a 63‑second cartoon clip from a single text prompt.
python sample.py --prompt "A ginger cat sunbathing on a balcony in Ghibli style, warm colors" --duration 63 --style ghibliCorporate marketing – style transfer
Apply a vintage film style to an existing 1‑minute footage while preserving detail.
python scripts/style_transfer.py --input_video ./factory.mp4 --style vintage --output ./factory_vintage.mp4Research – custom long‑video model fine‑tuning
Fine‑tune on a domain‑specific dataset (e.g., 100 three‑second pet clips) using progressive training stages.
# Example training pipeline (pseudo‑commands)
# 1. Prepare data
# 2. Activate conda environment with CUDA 12.3+ and gcc 11+
# 3. Train sequentially: 3s → 9s → 18s → 30s → 63s
# 4. Resulting model generates 63‑second videos with ~22 GB peak memoryQuick‑Start Guide (5 steps)
Step 1 – Clone repository and create environment
# Clone the project (including submodules)
git clone --recursive https://github.com/test-time-training/ttt-video-dit.git
cd ttt-video-dit
# Create conda environment
conda env create -f environment.yaml
conda activate ttt-video
# Install TTT‑MLP kernel (requires CUDA 12.3+ and gcc 11+)
cd ttt-tk && python setup.py install
cd ..Step 2 – Download pretrained weights
Obtain CogVideoX‑5B safetensors, VAE and T5 encoder from HuggingFace and place them under models/:
models/
├─ vae/
├─ t5-encoder/
├─ diffusion_pytorch_model-00001-of-00002.safetensors
└─ diffusion_pytorch_model-00002-of-00002.safetensorsStep 3 – Create generation config
prompt: "A white puppy chasing butterflies in a park, bright Disney style"
duration: 63 # seconds
resolution: "1080p"
style: "disney"
batch_size: 1
device: "cuda"Step 4 – Run generation
python sample.py --config configs/sample_63s.yamlStep 5 – Inspect output
The resulting MP4 file is saved in outputs/ and can be played directly or imported into editing software.
Performance Notes
Peak GPU memory ≈22 GB for 63‑second generation on RTX 4090.
Low‑memory mode can be enabled with --low_mem true, reducing memory to ~16 GB at the cost of slower inference.
Generation time for a 63‑second clip on RTX 4090 is roughly 10 minutes (depends on prompt complexity).
Repository
https://github.com/test-time-training/ttt-video-dit
Old Meng AI Explorer
Tracking global AI developments 24/7, focusing on large model iterations, commercial applications, and tech ethics. We break down hardcore technology into plain language, providing fresh news, in-depth analysis, and practical insights for professionals and enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
