FingER: Fine-Grained Evaluation and Reasoning for AI-Generated Videos

The paper introduces FingER, an entity-level evaluation framework and the FingER-Instruct-60k dataset for assessing AI-generated video quality with fine-grained reasoning, and demonstrates state-of-the-art zero-shot performance on multiple benchmarks using novel training strategies.

Amap Tech
Amap Tech
Amap Tech
FingER: Fine-Grained Evaluation and Reasoning for AI-Generated Videos

Research Background

Current video generation models such as Kling and Vidu produce overall good videos but still contain local defects (e.g., distorted hand). Detecting these fine-grained quality issues requires deeper semantic understanding. Multimodal large models like GPT-4o cannot reliably identify such defects even with detailed prompts.

Core Contributions

Entity-level evaluation framework (FingER) : Generates entity-level questions and provides an explainable reasoning process using a multimodal LLM. It assesses AI-generated videos across five dimensions—visual quality, text-to-video alignment, temporal consistency, factual consistency, and dynamic degree.

FingER-Instruct-60k dataset : Contains 3.3k AI-generated videos from models such as Kling, Vidu, Luma, together with 60k entity-level Q/A pairs and reasoning steps, covering complex scenes and verified for annotation quality.

Training methods to enhance logical reasoning : Explores SFT (supervised fine-tuning), SFT + Reasoning, and GRPO (group-wise policy optimization) with a cold-start strategy, demonstrating that GRPO significantly improves reasoning and generalization.

State-of-the-art performance : With only one-tenth of the training videos, FingER outperforms existing methods (e.g., VideoScore) on public benchmarks, showing strong generalization.

Experimental Results

Zero-shot tests on Qwen2.5-VL-7B show that entity-level scoring yields higher SRCC/PLCC correlation with human judgments than overall or dimension-level scoring. Adding reasoning improves visual quality scores but slightly reduces factual consistency.

Fine-tuned models using SFT + Reasoning achieve notable gains in text-to-video alignment (SRCC/PLCC from 73.85/77.98 to 79.34/83.16). Incorporating cold-start GRPO further boosts temporal and factual consistency, confirming the benefit of reinforcement learning for deep reasoning.

On unseen video generation models, FingER achieves state-of-the-art zero-shot results on GenAI-Bench (SRCC/PLCC = 57.03/56.59) and MonetBench (tau/diff = 58.00/62.80), surpassing VideoScore and VQAScore.

Conclusion

We emphasize the importance of fine-grained reasoning for AI-generated video quality assessment and introduce FingER, an entity-level evaluation framework with five assessment dimensions. The high-quality FingER-Instruct-60k dataset and the explored training strategies (SFT + Reasoning, Zero-GRPO, cold-start GRPO) enable the model to achieve state-of-the-art performance on the FingER-test benchmark and two public benchmarks, despite using only 3.3k training videos.

reasoningzero-shotdatasetmultimodal LLMAI-generated videofine-grained evaluation
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.