FingER: Fine-Grained, Reasoning‑Based Evaluation of AI‑Generated Videos
This article introduces FingER, a novel entity‑level evaluation framework and the FingER‑Instruct‑60k dataset for assessing AI‑generated video quality with fine‑grained reasoning, and demonstrates its state‑of‑the‑art performance on multiple benchmarks using advanced training strategies such as GRPO.
Conference and Paper Information
ACM International Conference on Multimedia (ACM MM) is a top‑tier conference in the multimedia field. ACM MM 2025 will be held in Dublin, Ireland, receiving 4,672 submissions and accepting 1,251 papers (26% acceptance). Two papers from the Gaode team were accepted.
Paper Title: FingER: Content Aware Fine‑grained Evaluation with Reasoning for AI‑Generated Videos
Paper Link: https://arxiv.org/pdf/2504.10358
Code Repository: https://github.com/AMAP-ML/FingER
Research Background
Current video generation models (e.g., Kling, Vidu) produce overall good quality but exhibit local defects such as hand distortion and temporal inconsistency. Multimodal large models like GPT‑4o fail to detect these fine‑grained issues using simple scoring or entity‑level questions.
Our work, FingER , addresses this gap by providing entity‑level questions and an interpretable reasoning process.
Core Contributions
Entity‑level evaluation framework: FingER combines an entity‑level question generation module with a multimodal model that yields explanations, assessing visual quality, text‑to‑video alignment, temporal consistency, factual consistency, and dynamic degree.
Fine‑grained reasoning dataset: FingER‑Instruct‑60k contains 3.3k AI‑generated videos (from Kling, Vidu, Luma, etc.) and 60k entity‑level Q/A pairs with reasoning, covering complex scenes and verified for annotation quality.
Training methods to enhance logical reasoning: We explore supervised fine‑tuning (SFT), SFT + Reasoning, and GRPO with a cold‑start strategy, showing that GRPO improves reasoning and generalization.
State‑of‑the‑art performance: Using only one‑tenth of the training videos, FingER outperforms baselines such as VideoScore on public benchmarks.
Experimental Results
Zero‑shot evaluation on Qwen2.5‑VL‑7B shows that entity‑level scoring yields higher SRCC/PLCC correlation with human judgments than overall or dimension‑level scoring. Adding reasoning improves visual quality scores but reduces factual consistency.
Table 1 compares zero‑shot performance on the FingER‑test dataset, demonstrating superior correlation.
Table 2 shows that fine‑tuning with reasoning (SFT + Reasoning) improves text‑to‑video alignment (SRCC/PLCC from 73.85/77.98 to 79.34/83.16). Incorporating cold‑start GRPO further boosts temporal and factual consistency.
Table 3 reports zero‑shot results on public benchmarks (GenAI‑Bench, MonetBench), where FingER achieves SRCC/PLCC of 57.03/56.59 and tau/diff of 58.00/62.80, surpassing VideoScore and VQAScore.
Conclusion
We emphasize the importance of fine‑grained reasoning for AI‑generated video quality assessment and introduce FingER, a five‑dimension entity‑level evaluation framework. The high‑quality FingER‑Instruct‑60k dataset and training strategies (including GRPO with cold‑start) enable state‑of‑the‑art performance on multiple benchmarks while using only 3.3k video samples.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
