Teaching Large Language Models to Predict Image Quality Scores with DeQA-Score
DeQA-Score, a CVPR 2025 work, shows how to train multimodal large language models to regress continuous image quality scores by discretizing scores into soft-label level tokens, preserving Gaussian distribution statistics and achieving state‑of‑the‑art performance without any installation.
Motivation
Image quality assessment (IQA) requires a numeric score that is easy to use in downstream pipelines. Human annotations are typically aggregated into a Gaussian distribution: the mean opinion score (MOS) is the mean and annotator disagreement forms the variance. Multimodal large language models (LLMs) operate on discrete tokens, so the core problem is how to regress a continuous Gaussian score from discrete token predictions.
Discretizing the Continuous Score
The continuous MOS is mapped to five textual level tokens bad, poor, fair, good, excellent. Prior work Q‑Align (ICML 2024) splits the MOS range uniformly and assigns a one‑hot label to each interval. DeQA‑Score instead integrates the Gaussian probability mass of each interval, producing a soft label that retains the full distributional information.
Reconstructing a Continuous Score
Both Q‑Align and DeQA‑Score map the five tokens to integer scores 1 – 5. DeQA‑Score then computes a weighted average using the soft‑label probabilities, yielding an estimate of the original MOS. In contrast, a one‑hot label would simply return the integer associated with the selected token.
Advantages of Soft Labels
Higher discretization accuracy: reconstructed MOS error ≈ 0.01 versus ≈ 0.30 for one‑hot (≈ 30× improvement).
Variance preservation: Jensen‑Shannon divergence between reconstructed and ground‑truth Gaussian is only 0.001; one‑hot discards variance entirely.
Inter‑image relationship fidelity: soft labels differentiate a large quality gap (image A vs B) while grouping images of similar quality (B vs C), which one‑hot fails to do.
Token relational structure: one‑hot assumes equal distances between all token pairs (orthogonal embedding), whereas soft labels partially retain the true ordinal distances among level tokens.
Model Training
Standard language tokens are trained with next‑token prediction. For level tokens, a KL‑divergence loss forces the predicted token distribution toward the constructed soft label. Because different datasets exhibit varying annotator variance, a fidelity loss from the UNIQUE framework (TIP 2021) is added to supervise relative image‑quality ordering, encouraging the model to capture pairwise preferences in addition to absolute scores.
Figure caption: next‑token prediction + KL divergence loss.
Experimental Results
Qualitative inspection shows that models trained with one‑hot labels collapse to a single level token, deviating from the ground‑truth Gaussian distribution. DeQA‑Score’s predictions closely match the full distribution.
Quantitatively, DeQA‑Score achieves state‑of‑the‑art regression performance:
Mean‑score reconstruction error ≈ 0.01 (vs 0.30 for one‑hot).
Jensen‑Shannon divergence for the full Gaussian ≈ 0.001.
Variance reconstruction aligns with ground truth, enabling full distribution recovery.
References
[1] Q‑Align: Teaching LMMs for Visual Scoring via Discrete Text‑Defined Levels, ICML 2024.
[2] Uncertainty‑aware Blind Image Quality Assessment in the Laboratory and Wild, TIP 2021.
[3] Depicting Beyond Scores: Advancing Image Quality Assessment through Multi‑modal Language Models, ECCV 2024.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
