FingER: Fine-Grained Evaluation and Reasoning for AI-Generated Videos
The paper introduces FingER, an entity-level evaluation framework and the FingER-Instruct-60k dataset for assessing AI-generated video quality with fine-grained reasoning, and demonstrates state-of-the-art zero-shot performance on multiple benchmarks using novel training strategies.
Research Background
Current video generation models such as Kling and Vidu produce overall good videos but still contain local defects (e.g., distorted hand). Detecting these fine-grained quality issues requires deeper semantic understanding. Multimodal large models like GPT-4o cannot reliably identify such defects even with detailed prompts.
Core Contributions
Entity-level evaluation framework (FingER) : Generates entity-level questions and provides an explainable reasoning process using a multimodal LLM. It assesses AI-generated videos across five dimensions—visual quality, text-to-video alignment, temporal consistency, factual consistency, and dynamic degree.
FingER-Instruct-60k dataset : Contains 3.3k AI-generated videos from models such as Kling, Vidu, Luma, together with 60k entity-level Q/A pairs and reasoning steps, covering complex scenes and verified for annotation quality.
Training methods to enhance logical reasoning : Explores SFT (supervised fine-tuning), SFT + Reasoning, and GRPO (group-wise policy optimization) with a cold-start strategy, demonstrating that GRPO significantly improves reasoning and generalization.
State-of-the-art performance : With only one-tenth of the training videos, FingER outperforms existing methods (e.g., VideoScore) on public benchmarks, showing strong generalization.
Experimental Results
Zero-shot tests on Qwen2.5-VL-7B show that entity-level scoring yields higher SRCC/PLCC correlation with human judgments than overall or dimension-level scoring. Adding reasoning improves visual quality scores but slightly reduces factual consistency.
Fine-tuned models using SFT + Reasoning achieve notable gains in text-to-video alignment (SRCC/PLCC from 73.85/77.98 to 79.34/83.16). Incorporating cold-start GRPO further boosts temporal and factual consistency, confirming the benefit of reinforcement learning for deep reasoning.
On unseen video generation models, FingER achieves state-of-the-art zero-shot results on GenAI-Bench (SRCC/PLCC = 57.03/56.59) and MonetBench (tau/diff = 58.00/62.80), surpassing VideoScore and VQAScore.
Conclusion
We emphasize the importance of fine-grained reasoning for AI-generated video quality assessment and introduce FingER, an entity-level evaluation framework with five assessment dimensions. The high-quality FingER-Instruct-60k dataset and the explored training strategies (SFT + Reasoning, Zero-GRPO, cold-start GRPO) enable the model to achieve state-of-the-art performance on the FingER-test benchmark and two public benchmarks, despite using only 3.3k training videos.
Amap Tech
Official Amap technology account showcasing all of Amap's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
