Artificial Intelligence 12 min read

How ViDA-UGC Leverages Large Multimodal Models for Fine-Grained Visual Quality Assessment

The article introduces ViDA-UGC, a large‑scale UGC visual‑quality dataset and its companion benchmark ViDA‑Bench, explains the MILP‑driven sampling, expert annotation pipeline, and CoT‑based evaluation framework, and shows how fine‑tuning popular multimodal LLMs on this data markedly improves low‑level quality perception, grounding, and description capabilities.

AIWalker

Mar 5, 2026

How ViDA-UGC Leverages Large Multimodal Models for Fine-Grained Visual Quality Assessment

Background

Visual quality is critical for multimedia platforms. Subjective tests are accurate but low‑throughput; CNN/Transformer‑based objective models lack explainability. Recent multimodal large models can perceive quality and generate natural‑language explanations, suggesting a new approach for objective assessment.

ViDA‑UGC Dataset Construction

ViDA‑UGC is a UGC‑focused visual‑quality dataset with 587,000 image‑text pairs and over 100,000 fine‑grained annotations. The construction pipeline has four stages:

Raw data sampling : Collected ~100 k images (or video frames) from four public image‑quality datasets, two super‑resolution datasets, several video‑quality sets, and indoor image collections.

Quantitative filtering : Measured each image on color, sharpness, brightness, contrast, entropy, and aspect‑ratio. A mixed‑integer linear programming (MILP) sampler selected 11,534 images that preserve a normal distribution across all dimensions, ensuring diversity.

Annotation : Defined ten common UGC defects (low clarity, motion blur, blocky artifacts, etc.). Annotators performed three‑phase labeling (pre‑label, formal label, re‑label) to produce MOS scores, defect categories, and bounding‑box coordinates.

Data synthesis : Used GPT‑4o to generate detailed degradation descriptions from MOS, defect type, and bounding‑box metadata. Produced three instruction‑type subsets: ViDA‑Grounding (localization), ViDA‑Perception (multiple‑choice perception), and ViDA‑Description (full‑image quality narration).

ViDA‑Bench Benchmark

From the annotated pool, 476 images were selected for evaluation. The benchmark contains 476 full‑image descriptions, 2,567 perception questions, and 3,106 grounding entries. Distribution analysis shows low‑clarity as the most frequent defect, matching real‑world UGC patterns.

Chain‑of‑Thought (CoT) Quality Evaluation Framework

Simple “think step by step” prompts failed to elicit accurate answers. Inspired by expert reasoning, a custom CoT framework was designed. The framework decomposes a quality‑assessment query into: (1) defect identification, (2) severity estimation, (3) impact on visual experience, and (4) natural‑language explanation. Applying this framework to baseline multimodal models (Qwen‑VL‑Chat, Qwen2VL‑7B‑Instruct, InternVL2.5‑8B, InternVL3‑8B) significantly raised answer accuracy across grounding, perception, and description tasks.

Experiments and Results

Baseline models were fine‑tuned on either the public Q‑Instruct dataset or the newly released ViDA‑UGC. Evaluation used both Q‑Bench and ViDA‑Bench.

On ViDA‑Bench, baseline models dropped 29 % in average score compared to Q‑Bench, confirming higher difficulty and finer granularity.

Training on ViDA‑UGC consistently improved low‑level perception and overall quality scores; Q‑Instruct sometimes caused performance degradation.

Strong models (Qwen2‑VL‑7B, InternVL3‑8B) suffered when fine‑tuned on Q‑Instruct but gained further improvements with ViDA‑UGC. Weak models (Qwen‑VL) also saw notable gains, surpassing Q‑Instruct results.

The CoT framework alone boosted un‑trained models across all dimensions; for reasoning tasks, some models outperformed their Q‑Instruct‑fine‑tuned counterparts.

Analysis of Training and Benchmark Comparisons

Comparing training sets (ViDA‑UGC vs. Q‑Instruct) shows that ViDA‑UGC raises both overall perception and low‑level quality perception. In contrast, Q‑Instruct can reduce performance, especially for models already strong in quality perception (e.g., Qwen2‑VL‑7B, InternVL3‑8B). For weaker models (e.g., Qwen‑VL), ViDA‑UGC yields the largest relative gains.

Benchmark comparison (ViDA‑Bench vs. Q‑Bench) reveals an average score drop of 29 % for the same baseline, indicating that ViDA‑Bench exposes deficiencies hidden by coarser benchmarks.

Implications

ViDA‑UGC enriches multimodal LLMs with low‑level visual knowledge, while ViDA‑Bench provides a robust, fine‑grained evaluation suite that reveals hidden deficiencies. The open‑source code and dataset are available at:

https://github.com/DYEvaLab/ViDA-MIPI-code

ArXiv paper: https://arxiv.org/pdf/2508.12605

benchmark chain of thought dataset multimodal models visual quality assessment fine-grained annotation

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.