CVPR 2025: DeQA-Score Lets LLMs Predict Image Quality Score Distributions
DeQA-Score introduces a soft‑label discretization that lets multimodal large language models regress continuous image‑quality scores as Gaussian distributions, achieving 30× lower mean error and preserving variance and inter‑image relationships, with KL‑divergence and fidelity losses driving state‑of‑the‑art performance.
Why this project?
DepictQA demonstrated that natural‑language descriptions can capture the nuanced aspects of image quality, but such descriptions cannot serve directly as a numeric metric. Review feedback highlighted the need for an easy‑to‑use, accurate scalar score. DeQA‑Score therefore regresses a precise image‑quality score from a multimodal large language model (LMM) while still preserving the expressive power of language.
Main challenge
The target quality score is modeled as a Gaussian distribution: the mean equals the mean opinion score (MOS) aggregated from multiple annotators, and the variance reflects annotator disagreement. LMMs, however, generate discrete token sequences, creating a mismatch between a continuous Gaussian target and a discrete vocabulary.
Discretizing the continuous score into level tokens
To bridge the gap, the continuous score is mapped to five ordered level tokens – bad, poor, fair, good, excellent. Prior work (Q‑Align) uses uniform binning and assigns a one‑hot label to each interval, effectively treating the token index as the score (1–5). DeQA‑Score instead integrates the Gaussian probability mass over each interval, producing a soft label – a probability distribution over the five tokens.
Reconstructing a Gaussian from level tokens
For a one‑hot label, the predicted token index is directly taken as the score (e.g., good → 4). With a soft label, DeQA‑Score multiplies each token’s probability by its integer score and sums the results to obtain the mean. The same probabilities are used to compute the variance, yielding a full Gaussian prediction.
Advantages of the soft‑label approach
More accurate discretization : Mean absolute error between reconstructed mean and MOS drops from ~0.30 (one‑hot) to ~0.01, a 30× improvement.
Preserves Gaussian variance : KL divergence between reconstructed and ground‑truth Gaussians is only 0.001, whereas one‑hot discards variance entirely.
Maintains inter‑image relationships : In a scenario where images A and B have a large quality gap, one‑hot assigns both the same token, while the soft label correctly yields A < B. For images B and C with similar quality, the soft label gives identical token distributions, unlike the inconsistent one‑hot assignment.
Reflects ordinal token distances : One‑hot assumes equal Euclidean distance between all token pairs (orthogonal embedding), which contradicts the true ordinal nature of quality levels. The soft‑label distribution partially retains these ordinal relationships.
Model training
Standard tokens are trained with the usual next‑token prediction objective. Level tokens are trained with a KL‑divergence loss that pushes the predicted token distribution toward the constructed soft label. Because different datasets exhibit varying annotator variance, DeQA‑Score also incorporates the fidelity loss from the UNIQUE framework (TIP 2021) to supervise relative quality ordering across datasets, encouraging the model to learn image‑wise relationships rather than absolute scores.
Experimental results
Visualization shows that a model trained with one‑hot labels collapses to a single token, deviating from the ground‑truth Gaussian. In contrast, DeQA‑Score’s predictions align closely with the true distribution.
Quantitatively, DeQA‑Score achieves state‑of‑the‑art regression performance on both mean score and variance. The reconstructed Gaussian matches the ground‑truth distribution, confirming accurate capture of both central tendency and uncertainty.
The predicted variance further validates the model’s ability to represent distributional characteristics.
References
Q‑Align: Teaching LMMs for Visual Scoring via Discrete Text‑Defined Levels, ICML 2024.
Uncertainty‑aware Blind Image Quality Assessment in the Laboratory and Wild, TIP 2021.
Depicting Beyond Scores: Advancing Image Quality Assessment through Multi‑modal Language Models, ECCV 2024.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
