Unlock 1‑Minute AI Video Generation with TTT‑Video‑Dit: Break the 3‑Second Limit

TTT‑Video‑Dit is an open‑source framework that uses test‑time‑training and hierarchical attention to generate coherent 63‑second videos with style‑transfer, dramatically reducing GPU memory requirements so a single RTX 4090 can replace costly H100 clusters, enabling creators and developers to produce long AI videos efficiently.

Old Meng AI Explorer
Old Meng AI Explorer
Old Meng AI Explorer
Unlock 1‑Minute AI Video Generation with TTT‑Video‑Dit: Break the 3‑Second Limit

Overview

TTT-Video-Dit is an open‑source framework for generating long video clips (up to 63 s) with consistent style using test‑time training (TTT) and a layered attention architecture. It builds on the CogVideoX‑5B diffusion model and runs on a single consumer‑grade GPU.

Key Technical Features

Segment‑extension + global attention : Starts from a 3‑second seed, fine‑tunes style, then iteratively expands to 9 s, 18 s, 30 s and finally 63 s. Reported continuity improvement ≈80 % over naïve stitching.

Unified style‑transfer : Local attention preserves short‑clip detail, while an added TTT layer enforces global style consistency. Improves fine‑detail fidelity by ~3× compared with conventional style‑transfer pipelines.

GPU memory optimisation : Hierarchical training and parameter reuse cut peak memory from ~48 GB (H100) to ~24 GB; an optional low‑memory mode reduces it to ~16 GB, enabling inference on RTX 4090.

Extensible configuration : Video length (3‑63 s), resolution (up to 1080p), and style template (10+ built‑in) are configurable via a YAML file.

Typical Use Cases

Social‑media content creation

Generate a 63‑second cartoon clip from a single text prompt.

python sample.py --prompt "A ginger cat sunbathing on a balcony in Ghibli style, warm colors" --duration 63 --style ghibli

Corporate marketing – style transfer

Apply a vintage film style to an existing 1‑minute footage while preserving detail.

python scripts/style_transfer.py --input_video ./factory.mp4 --style vintage --output ./factory_vintage.mp4

Research – custom long‑video model fine‑tuning

Fine‑tune on a domain‑specific dataset (e.g., 100 three‑second pet clips) using progressive training stages.

# Example training pipeline (pseudo‑commands)
# 1. Prepare data
# 2. Activate conda environment with CUDA 12.3+ and gcc 11+
# 3. Train sequentially: 3s → 9s → 18s → 30s → 63s
# 4. Resulting model generates 63‑second videos with ~22 GB peak memory

Quick‑Start Guide (5 steps)

Step 1 – Clone repository and create environment

# Clone the project (including submodules)
git clone --recursive https://github.com/test-time-training/ttt-video-dit.git
cd ttt-video-dit

# Create conda environment
conda env create -f environment.yaml
conda activate ttt-video

# Install TTT‑MLP kernel (requires CUDA 12.3+ and gcc 11+)
cd ttt-tk && python setup.py install
cd ..

Step 2 – Download pretrained weights

Obtain CogVideoX‑5B safetensors, VAE and T5 encoder from HuggingFace and place them under models/:

models/
├─ vae/
├─ t5-encoder/
├─ diffusion_pytorch_model-00001-of-00002.safetensors
└─ diffusion_pytorch_model-00002-of-00002.safetensors

Step 3 – Create generation config

prompt: "A white puppy chasing butterflies in a park, bright Disney style"
duration: 63          # seconds
resolution: "1080p"
style: "disney"
batch_size: 1
device: "cuda"

Step 4 – Run generation

python sample.py --config configs/sample_63s.yaml

Step 5 – Inspect output

The resulting MP4 file is saved in outputs/ and can be played directly or imported into editing software.

Performance Notes

Peak GPU memory ≈22 GB for 63‑second generation on RTX 4090.

Low‑memory mode can be enabled with --low_mem true, reducing memory to ~16 GB at the cost of slower inference.

Generation time for a 63‑second clip on RTX 4090 is roughly 10 minutes (depends on prompt complexity).

Repository

https://github.com/test-time-training/ttt-video-dit

open-sourceGPU OptimizationStyle Transferlong video AITTT-Video-Dit
Old Meng AI Explorer
Written by

Old Meng AI Explorer

Tracking global AI developments 24/7, focusing on large model iterations, commercial applications, and tech ethics. We break down hardcore technology into plain language, providing fresh news, in-depth analysis, and practical insights for professionals and enthusiasts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.