How ViDA-UGC Leverages Large Multimodal Models for Fine-Grained Visual Quality Assessment
The article introduces ViDA-UGC, a large‑scale UGC visual‑quality dataset and its companion benchmark ViDA‑Bench, explains the MILP‑driven sampling, expert annotation pipeline, and CoT‑based evaluation framework, and shows how fine‑tuning popular multimodal LLMs on this data markedly improves low‑level quality perception, grounding, and description capabilities.
Background
Visual quality is critical for multimedia platforms. Subjective tests are accurate but low‑throughput; CNN/Transformer‑based objective models lack explainability. Recent multimodal large models can perceive quality and generate natural‑language explanations, suggesting a new approach for objective assessment.
ViDA‑UGC Dataset Construction
ViDA‑UGC is a UGC‑focused visual‑quality dataset with 587,000 image‑text pairs and over 100,000 fine‑grained annotations. The construction pipeline has four stages:
Raw data sampling : Collected ~100 k images (or video frames) from four public image‑quality datasets, two super‑resolution datasets, several video‑quality sets, and indoor image collections.
Quantitative filtering : Measured each image on color, sharpness, brightness, contrast, entropy, and aspect‑ratio. A mixed‑integer linear programming (MILP) sampler selected 11,534 images that preserve a normal distribution across all dimensions, ensuring diversity.
Annotation : Defined ten common UGC defects (low clarity, motion blur, blocky artifacts, etc.). Annotators performed three‑phase labeling (pre‑label, formal label, re‑label) to produce MOS scores, defect categories, and bounding‑box coordinates.
Data synthesis : Used GPT‑4o to generate detailed degradation descriptions from MOS, defect type, and bounding‑box metadata. Produced three instruction‑type subsets: ViDA‑Grounding (localization), ViDA‑Perception (multiple‑choice perception), and ViDA‑Description (full‑image quality narration).
ViDA‑Bench Benchmark
From the annotated pool, 476 images were selected for evaluation. The benchmark contains 476 full‑image descriptions, 2,567 perception questions, and 3,106 grounding entries. Distribution analysis shows low‑clarity as the most frequent defect, matching real‑world UGC patterns.
Chain‑of‑Thought (CoT) Quality Evaluation Framework
Simple “think step by step” prompts failed to elicit accurate answers. Inspired by expert reasoning, a custom CoT framework was designed. The framework decomposes a quality‑assessment query into: (1) defect identification, (2) severity estimation, (3) impact on visual experience, and (4) natural‑language explanation. Applying this framework to baseline multimodal models (Qwen‑VL‑Chat, Qwen2VL‑7B‑Instruct, InternVL2.5‑8B, InternVL3‑8B) significantly raised answer accuracy across grounding, perception, and description tasks.
Experiments and Results
Baseline models were fine‑tuned on either the public Q‑Instruct dataset or the newly released ViDA‑UGC. Evaluation used both Q‑Bench and ViDA‑Bench.
On ViDA‑Bench, baseline models dropped 29 % in average score compared to Q‑Bench, confirming higher difficulty and finer granularity.
Training on ViDA‑UGC consistently improved low‑level perception and overall quality scores; Q‑Instruct sometimes caused performance degradation.
Strong models (Qwen2‑VL‑7B, InternVL3‑8B) suffered when fine‑tuned on Q‑Instruct but gained further improvements with ViDA‑UGC. Weak models (Qwen‑VL) also saw notable gains, surpassing Q‑Instruct results.
The CoT framework alone boosted un‑trained models across all dimensions; for reasoning tasks, some models outperformed their Q‑Instruct‑fine‑tuned counterparts.
Analysis of Training and Benchmark Comparisons
Comparing training sets (ViDA‑UGC vs. Q‑Instruct) shows that ViDA‑UGC raises both overall perception and low‑level quality perception. In contrast, Q‑Instruct can reduce performance, especially for models already strong in quality perception (e.g., Qwen2‑VL‑7B, InternVL3‑8B). For weaker models (e.g., Qwen‑VL), ViDA‑UGC yields the largest relative gains.
Benchmark comparison (ViDA‑Bench vs. Q‑Bench) reveals an average score drop of 29 % for the same baseline, indicating that ViDA‑Bench exposes deficiencies hidden by coarser benchmarks.
Implications
ViDA‑UGC enriches multimodal LLMs with low‑level visual knowledge, while ViDA‑Bench provides a robust, fine‑grained evaluation suite that reveals hidden deficiencies. The open‑source code and dataset are available at:
https://github.com/DYEvaLab/ViDA-MIPI-code
ArXiv paper: https://arxiv.org/pdf/2508.12605
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
