Artificial Intelligence 6 min read

FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

FancyVideo is an open‑source UNet‑based video generation model that supports arbitrary resolutions, aspect ratios, styles, and motion dynamics by introducing a Cross‑frame Textual Guidance Module (CTGM) with temporal injectors, refiners, and boosters, achieving state‑of‑the‑art results on multiple benchmarks and enabling versatile applications such as video extension, backtracking, and frame interpolation.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
FancyVideo: Towards Dynamic and Consistent Video Generation via Cross-frame Textual Guidance

Recently, the open‑source community has welcomed a powerful video generation tool that can run on consumer‑grade GPUs (e.g., GeForce RTX 3090) to produce videos of any resolution, aspect ratio, style, and motion intensity. The model, called FancyVideo, is a UNet‑based video generation system jointly developed by the 360AI team and Sun Yat‑sen University.

The authors built upon the publicly available 61‑frame model and demonstrated its ability to adapt to different resolutions and aspect ratios, support various artistic styles, and generate videos with varying degrees of motion.

Cross‑frame Textual Guidance Module (CTGM)

Existing text‑to‑video (T2V) approaches typically use spatial cross‑attention, applying the same textual condition to every frame, which limits temporal flexibility. FancyVideo addresses this limitation by designing CTGM, which introduces frame‑specific textual guidance.

CTGM consists of three sub‑modules:

Temporal Information Injector (TII) – injects frame‑specific latent information into the textual condition to obtain cross‑frame text guidance.

Temporal Affinity Refiner (TAR) – refines the correlation matrix between cross‑frame textual conditions and latent features along the time dimension.

Temporal Feature Booster (TFB) – enhances temporal consistency of the latent features.

The overall training pipeline inserts these temporal layers and the CTGM‑based motion module into a 2D text‑to‑image (T2I) backbone, forming a T2V model. During inference, the first frame is generated via T2I, followed by image‑to‑video (I2V) generation, preserving high image quality while reducing training cost.

To enable motion control, FancyVideo incorporates optical‑flow information extracted by RAFT and time embeddings into the network during training.

Experimental Results

Quantitative and qualitative evaluations on the EvalCrafter benchmark show that FancyVideo outperforms existing T2V models in video quality, text consistency, motion realism, and temporal coherence. Zero‑shot evaluations on UCF‑101 and MSR‑VTT also achieve state‑of‑the‑art scores on IS (video richness) and CLIPSIM (text‑video alignment). Human studies confirm FancyVideo’s superiority across the same dimensions.

Applications

Thanks to its training pipeline, FancyVideo can perform both T2V and I2V tasks, as well as frame interpolation on generated keyframes. It also supports video extension and backtracking operations.

Within a week of release, the community created a ComfyUI plugin for FancyVideo, allowing users to run the model locally. The team plans to release longer, higher‑quality models and a free web interface.

Conclusion

Compared with commercial video‑generation products like SORA, open‑source models evolve more slowly, but FancyVideo provides ordinary users with a powerful, free alternative. Continued community effort is expected to make video generation a practical tool for everyday creative and professional workflows.

video generationAI researchtemporal modelingcross-frame guidanceUNet
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.