FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance
FancyVideo is an open‑source UNet‑based video generation model that supports arbitrary resolutions, aspect ratios, styles, and motion dynamics by introducing a Cross‑frame Textual Guidance Module (CTGM) with temporal injectors, refiners, and boosters, achieving state‑of‑the‑art results on multiple benchmarks and enabling versatile applications such as video extension, backtracking, and frame interpolation.
Recently, the open‑source community has welcomed a powerful video generation tool that can run on consumer‑grade GPUs (e.g., GeForce RTX 3090) to produce videos of any resolution, aspect ratio, style, and motion intensity. The model, called FancyVideo, is a UNet‑based video generation system jointly developed by the 360AI team and Sun Yat‑sen University.
The authors built upon the publicly available 61‑frame model and demonstrated its ability to adapt to different resolutions and aspect ratios, support various artistic styles, and generate videos with varying degrees of motion.
Cross‑frame Textual Guidance Module (CTGM)
Existing text‑to‑video (T2V) approaches typically use spatial cross‑attention, applying the same textual condition to every frame, which limits temporal flexibility. FancyVideo addresses this limitation by designing CTGM, which introduces frame‑specific textual guidance.
CTGM consists of three sub‑modules:
Temporal Information Injector (TII) – injects frame‑specific latent information into the textual condition to obtain cross‑frame text guidance.
Temporal Affinity Refiner (TAR) – refines the correlation matrix between cross‑frame textual conditions and latent features along the time dimension.
Temporal Feature Booster (TFB) – enhances temporal consistency of the latent features.
The overall training pipeline inserts these temporal layers and the CTGM‑based motion module into a 2D text‑to‑image (T2I) backbone, forming a T2V model. During inference, the first frame is generated via T2I, followed by image‑to‑video (I2V) generation, preserving high image quality while reducing training cost.
To enable motion control, FancyVideo incorporates optical‑flow information extracted by RAFT and time embeddings into the network during training.
Experimental Results
Quantitative and qualitative evaluations on the EvalCrafter benchmark show that FancyVideo outperforms existing T2V models in video quality, text consistency, motion realism, and temporal coherence. Zero‑shot evaluations on UCF‑101 and MSR‑VTT also achieve state‑of‑the‑art scores on IS (video richness) and CLIPSIM (text‑video alignment). Human studies confirm FancyVideo’s superiority across the same dimensions.
Applications
Thanks to its training pipeline, FancyVideo can perform both T2V and I2V tasks, as well as frame interpolation on generated keyframes. It also supports video extension and backtracking operations.
Within a week of release, the community created a ComfyUI plugin for FancyVideo, allowing users to run the model locally. The team plans to release longer, higher‑quality models and a free web interface.
Conclusion
Compared with commercial video‑generation products like SORA, open‑source models evolve more slowly, but FancyVideo provides ordinary users with a powerful, free alternative. Continued community effort is expected to make video generation a practical tool for everyday creative and professional workflows.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.