Artificial Intelligence 9 min read

FingER: Fine-Grained, Reasoning‑Based Evaluation of AI‑Generated Videos

This article introduces FingER, a novel entity‑level evaluation framework and the FingER‑Instruct‑60k dataset for assessing AI‑generated video quality with fine‑grained reasoning, and demonstrates its state‑of‑the‑art performance on multiple benchmarks using advanced training strategies such as GRPO.

Amap Tech

Jul 24, 2025

FingER: Fine-Grained, Reasoning‑Based Evaluation of AI‑Generated Videos

Conference and Paper Information

ACM International Conference on Multimedia (ACM MM) is a top‑tier conference in the multimedia field. ACM MM 2025 will be held in Dublin, Ireland, receiving 4,672 submissions and accepting 1,251 papers (26% acceptance). Two papers from the Gaode team were accepted.

Paper Title: FingER: Content Aware Fine‑grained Evaluation with Reasoning for AI‑Generated Videos

Paper Link: https://arxiv.org/pdf/2504.10358

Code Repository: https://github.com/AMAP-ML/FingER

ACM MM 2025 conference logo

Research Background

Current video generation models (e.g., Kling, Vidu) produce overall good quality but exhibit local defects such as hand distortion and temporal inconsistency. Multimodal large models like GPT‑4o fail to detect these fine‑grained issues using simple scoring or entity‑level questions.

Our work, FingER , addresses this gap by providing entity‑level questions and an interpretable reasoning process.

Figure 1: Local defects in AI‑generated videos and multimodal model responses

Core Contributions

Entity‑level evaluation framework: FingER combines an entity‑level question generation module with a multimodal model that yields explanations, assessing visual quality, text‑to‑video alignment, temporal consistency, factual consistency, and dynamic degree.

Fine‑grained reasoning dataset: FingER‑Instruct‑60k contains 3.3k AI‑generated videos (from Kling, Vidu, Luma, etc.) and 60k entity‑level Q/A pairs with reasoning, covering complex scenes and verified for annotation quality.

Training methods to enhance logical reasoning: We explore supervised fine‑tuning (SFT), SFT + Reasoning, and GRPO with a cold‑start strategy, showing that GRPO improves reasoning and generalization.

State‑of‑the‑art performance: Using only one‑tenth of the training videos, FingER outperforms baselines such as VideoScore on public benchmarks.

Figure 2: FingER framework, dataset, and training strategies

Experimental Results

Zero‑shot evaluation on Qwen2.5‑VL‑7B shows that entity‑level scoring yields higher SRCC/PLCC correlation with human judgments than overall or dimension‑level scoring. Adding reasoning improves visual quality scores but reduces factual consistency.

Table 1 compares zero‑shot performance on the FingER‑test dataset, demonstrating superior correlation.

Table 2 shows that fine‑tuning with reasoning (SFT + Reasoning) improves text‑to‑video alignment (SRCC/PLCC from 73.85/77.98 to 79.34/83.16). Incorporating cold‑start GRPO further boosts temporal and factual consistency.

Table 3 reports zero‑shot results on public benchmarks (GenAI‑Bench, MonetBench), where FingER achieves SRCC/PLCC of 57.03/56.59 and tau/diff of 58.00/62.80, surpassing VideoScore and VQAScore.

Figure 3: Sample model reasoning results on FingER‑test

Conclusion

We emphasize the importance of fine‑grained reasoning for AI‑generated video quality assessment and introduce FingER, a five‑dimension entity‑level evaluation framework. The high‑quality FingER‑Instruct‑60k dataset and training strategies (including GRPO with cold‑start) enable state‑of‑the‑art performance on multiple benchmarks while using only 3.3k video samples.

video quality assessment multimodal model AI-generated video fine-grained evaluation reasoning dataset state-of-the-art

Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.