Teaching Large Language Models to Predict Image Quality Scores with DeQA-Score

DeQA-Score, a CVPR 2025 work, shows how to train multimodal large language models to regress continuous image quality scores by discretizing scores into soft-label level tokens, preserving Gaussian distribution statistics and achieving state‑of‑the‑art performance without any installation.

AIWalker
AIWalker
AIWalker
Teaching Large Language Models to Predict Image Quality Scores with DeQA-Score

Motivation

Image quality assessment (IQA) requires a numeric score that is easy to use in downstream pipelines. Human annotations are typically aggregated into a Gaussian distribution: the mean opinion score (MOS) is the mean and annotator disagreement forms the variance. Multimodal large language models (LLMs) operate on discrete tokens, so the core problem is how to regress a continuous Gaussian score from discrete token predictions.

Discretizing the Continuous Score

The continuous MOS is mapped to five textual level tokens bad, poor, fair, good, excellent. Prior work Q‑Align (ICML 2024) splits the MOS range uniformly and assigns a one‑hot label to each interval. DeQA‑Score instead integrates the Gaussian probability mass of each interval, producing a soft label that retains the full distributional information.

Discretization comparison
Discretization comparison

Reconstructing a Continuous Score

Both Q‑Align and DeQA‑Score map the five tokens to integer scores 15. DeQA‑Score then computes a weighted average using the soft‑label probabilities, yielding an estimate of the original MOS. In contrast, a one‑hot label would simply return the integer associated with the selected token.

Advantages of Soft Labels

Higher discretization accuracy: reconstructed MOS error ≈ 0.01 versus ≈ 0.30 for one‑hot (≈ 30× improvement).

Variance preservation: Jensen‑Shannon divergence between reconstructed and ground‑truth Gaussian is only 0.001; one‑hot discards variance entirely.

Inter‑image relationship fidelity: soft labels differentiate a large quality gap (image A vs B) while grouping images of similar quality (B vs C), which one‑hot fails to do.

Token relational structure: one‑hot assumes equal distances between all token pairs (orthogonal embedding), whereas soft labels partially retain the true ordinal distances among level tokens.

Soft label advantages
Soft label advantages

Model Training

Standard language tokens are trained with next‑token prediction. For level tokens, a KL‑divergence loss forces the predicted token distribution toward the constructed soft label. Because different datasets exhibit varying annotator variance, a fidelity loss from the UNIQUE framework (TIP 2021) is added to supervise relative image‑quality ordering, encouraging the model to capture pairwise preferences in addition to absolute scores.

Training pipeline
Training pipeline

Figure caption: next‑token prediction + KL divergence loss.

Fidelity loss
Fidelity loss

Experimental Results

Qualitative inspection shows that models trained with one‑hot labels collapse to a single level token, deviating from the ground‑truth Gaussian distribution. DeQA‑Score’s predictions closely match the full distribution.

Distribution comparison
Distribution comparison

Quantitatively, DeQA‑Score achieves state‑of‑the‑art regression performance:

Mean‑score reconstruction error ≈ 0.01 (vs 0.30 for one‑hot).

Jensen‑Shannon divergence for the full Gaussian ≈ 0.001.

Variance reconstruction aligns with ground truth, enabling full distribution recovery.

Mean score results
Mean score results
Variance reconstruction
Variance reconstruction

References

[1] Q‑Align: Teaching LMMs for Visual Scoring via Discrete Text‑Defined Levels, ICML 2024.

[2] Uncertainty‑aware Blind Image Quality Assessment in the Laboratory and Wild, TIP 2021.

[3] Depicting Beyond Scores: Advancing Image Quality Assessment through Multi‑modal Language Models, ECCV 2024.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

image quality assessmentmultimodal LLMCVPR2025DeQA-Scoresoft label
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.