FlashVideo Sets New SOTA for Faster, High‑Fidelity High‑Resolution Video Generation

FlashVideo introduces a two‑stage diffusion framework that first ensures prompt fidelity at low resolution with a 5‑billion‑parameter DiT, then efficiently adds fine details at high resolution using flow matching, achieving state‑of‑the‑art quality with dramatically lower compute cost.

AIWalker
AIWalker
AIWalker
FlashVideo Sets New SOTA for Faster, High‑Fidelity High‑Resolution Video Generation

Overview

Recent advances in diffusion modeling, large‑scale architectures, and massive datasets have propelled text‑to‑video (T2V) generation forward. While DiT‑based models excel at scaling, achieving high prompt fidelity and visual detail at high resolution still demands huge parameter counts and many function evaluations (NFE). FlashVideo addresses this by a novel two‑stage pipeline that strategically allocates model capacity and NFE across stages.

Method and Model

1. Framework Overview

FlashVideo first compresses video frames into latent features using a 3D causal variational auto‑encoder (VAE) (Yang et al., 2024). The goal is to generate 6‑second, 1080p videos (8 fps). As illustrated in Figure 2, the pipeline consists of a low‑resolution stage I (50 B‑parameter DiT) and a high‑resolution stage II (20 B‑parameter DiT), both employing 3D RoPE for spatio‑temporal encoding. Stage I operates at 270p, preserving prompt fidelity with ample NFE; stage II performs flow‑matching to upscale to full resolution using only four NFE steps.

2. Low‑Resolution Stage I

Stage I uses the large‑capacity CogVideoX‑5B model (50 B parameters) fine‑tuned via parameter‑efficient fine‑tuning (PEFT). LoRA adapters of rank 128 are inserted into all attention layers (Vaswani et al., 2017), feed‑forward networks, and adaptive layer‑norm (Perez et al., 2018). PEFT proves more robust than full‑parameter fine‑tuning, especially with a batch size of 32, where full‑parameter updates degrade quality.

3. High‑Resolution Stage II

Stage II adopts a CogVideoX‑2B‑style architecture but replaces the original positional‑frequency embedding with 3D RoPE (Su et al., 2024) for better scalability. Unlike methods that rely on spatio‑temporal decomposition and time‑slice attention (He et al., 2024), FlashVideo uses full 3D attention to maintain consistent fine‑grained details across large motions and scale changes.

To avoid the costly diffusion process from scratch, FlashVideo employs flow‑matching (Liu et al., 2022; Lipman et al., 2022) to map low‑resolution latents to high‑resolution latents via linear interpolation between two latent points, eliminating redundant sampling and extra control parameters (Zhang et al., 2023a; Yu et al., 2024; He et al., 2024). During inference, a simple Euler solver with 4–6 steps suffices.

Experiments and Results

1. Data Collection

A curated 1080p video corpus of 2 M high‑quality clips was assembled, filtered by aesthetic and motion scores using RAFT optical flow (Teed & Deng, 2020) with a motion threshold of < 1.1. Additionally, 1.5 M high‑resolution images (2560 × 1440) were collected. Detailed subtitles generated by an internal caption model annotate all media. A human‑aligned subset of 50 k videos was selected for fine‑tuning.

2. Training Settings

Stage I was trained on 270p video frames for 50 k iterations (batch 32, LR 4e‑4, AdamW, weight decay 0.01, grad‑clip 0.1). Stage II underwent three pre‑training phases: (1) 25 k iterations on 540 × 960 image patches cropped from 2048 × 2048 images; (2) 30 k iterations on a mixed image‑video dataset (1:2 ratio); (3) 5 k iterations on full‑resolution videos, followed by 700 iterations of human‑preference alignment. Latent degradation noise steps were initially sampled from 600‑900, later narrowed to 650‑750 based on Table 10.

3. Qualitative Results

Figure 4 shows stage I outputs with accurate motion and prompt fidelity but lacking fine structures (red boxes). Stage II enriches textures, restores small object details, and removes artifacts (green boxes), improving faces, fur, foliage, and fabric. Additional examples in Figure 5 highlight artifact correction and detail enhancement.

4. Quantitative Results

Using VBench‑Long (Huang et al., 2024) with five videos per prompt, FlashVideo achieved semantic scores > 81 at both 8 fps and 24 fps. Stage I alone scored 60.47 (270p) on aesthetics and 61.39 on imaging quality; after stage II, scores rose to 62.29 and 66.21 respectively (Table 1). End‑to‑end inference takes ~2 minutes, far faster than Hunyuan Video (13 B single‑stage, 1742 s for 720p). Users can preview a 270p clip in ~30 s before committing to stage II.

Comprehensive image‑quality metrics (MUSIQ↑, MANIQA↑, CLIPIQA↑, NIQE↓) in Table 2 confirm consistent improvements across all metrics. Video‑level DOVER (Wu et al., 2023) also shows significant gains after stage II.

5. Comparison with Video‑Enhancement Methods

FlashVideo was benchmarked against VEnhancer (He et al., 2024), Upscale‑a‑Video (Zhou et al., 2024), and RealBasicVSR (Chan et al., 2022) on a curated 100‑prompt “Texture100” set. FlashVideo consistently outperformed competitors in both quantitative scores and visual fidelity, while being ~7× faster than VEnhancer. Notably, VEnhancer’s time‑slice attention caused identity drift in long‑range motion sequences (Figure 7), whereas FlashVideo’s full 3D attention preserved consistent facial features and textures.

Conclusion

FlashVideo presents a decoupled two‑stage framework that first secures prompt fidelity at low resolution and then efficiently injects high‑frequency details at high resolution. Extensive ablations demonstrate that strategic allocation of model capacity and NFE yields state‑of‑the‑art video quality with dramatically reduced compute, offering a practical solution for commercial deployment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIvideo generationDiffusion Modelshigh-resolutionFlashVideotwo-stage framework
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.