FlightVGM: FPGA-Accelerated Inference for Video Generation Models Wins Best Paper at FPGA 2025
The FlightVGM paper, awarded Best Paper at FPGA 2025, details a novel FPGA-based inference IP for video generation models that leverages time‑space activation sparsity, mixed‑precision DSP58 extensions, and adaptive scheduling to achieve up to 1.30× performance and 4.49× energy‑efficiency gains over a NVIDIA 3090 GPU while preserving model accuracy.
At the FPGA 2025 conference, the Best Paper award was presented to FlightVGM, a video‑generation model inference IP jointly developed by researchers from Shanghai Jiao Tong University, Tsinghua University, and the startup 无问芯穹. This marks the first time the award has gone to a work led entirely by a mainland Chinese team.
FlightVGM implements efficient inference for Video Generation Models (VGMs) on FPGA, following the team’s earlier work FlightLLM for large language models. Compared with an NVIDIA 3090 GPU, FlightVGM on an AMD V80 FPGA delivers a 1.30× speedup and a 4.49× improvement in energy efficiency, with peak compute performance differences exceeding 21×.
The paper’s core contributions are threefold:
1. Time‑Space Activation Sparsity: By detecting similarity between tokens across frames and within frames using cosine similarity, redundant activations are skipped, dramatically reducing compute load.
2. Float‑Int Mixed‑Precision DSP58 Extension: Critical modules (e.g., attention) retain FP16 precision while less‑critical parts are quantized to INT8, realized through a novel DSP‑Expansion (DSP‑E) architecture that dynamically reconfigures DSP58 resources.
3. Dynamic‑Static Adaptive Scheduling: The scheduler reorders operations based on actual workload to mitigate load imbalance caused by sparsity.
Experimental results show that FlightVGM incurs negligible accuracy loss (average drop of 0.008, or 0.042 with full INT8 quantization) while delivering superior performance and energy efficiency compared with GPU baselines. The sparsity‑aware and mixed‑precision design enables the FPGA implementation to outperform the GPU despite the latter’s higher raw compute capability.
Looking forward, the authors propose exploring AI Engine (AIE) plus High‑Bandwidth Memory (HBM) architectures to further boost efficiency for video‑generation workloads, positioning FPGA as a key platform for future large‑model inference.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.