Artificial Intelligence 18 min read

A Survey of Image Quality Evaluation Metrics for Text-to-Image Generation

The survey traces the evolution of image‑quality evaluation for text‑to‑image generation—from early handcrafted edge and color cues, through GAN‑era similarity scores such as IS, FID and KID, to modern perceptual and CLIP‑based metrics like LPIPS, CLIPScore, TRIQ, IQT and human‑preference models—highlighting a shift toward semantic, aesthetic, and text‑image alignment measures and forecasting domain‑specific metrics for future diffusion models.

DaTaobao Tech

Feb 28, 2024

A Survey of Image Quality Evaluation Metrics for Text-to-Image Generation

This article reviews evaluation metrics used in image generation models over the past decade, addressing two questions: the characteristics of evaluation standards across periods, and how image quality assessment aids model iteration.

Before 2016, image quality assessment relied on handcrafted features such as edge density, color distribution, and blur detection. Methods included Laplacian edge extraction, color histograms, and frequency analysis to quantify clarity and composition.

From 2016 to 2019, with the rise of GANs, metrics shifted to measuring similarity between generated and real images. Inception Score (IS) evaluates diversity and quality using a pretrained Inception‑v3 classifier, while Frechet Inception Distance (FID) compares feature distributions assuming Gaussianity. Both have limitations, e.g., sensitivity to mode collapse and inability to detect over‑fitting.

Kernel Inception Distance (KID) replaces the Gaussian assumption with an unbiased Maximum Mean Discrepancy estimate, offering a more reliable distance at higher computational cost.

Perceptual metrics such as Learned Perceptual Image Patch Similarity (LPIPS) use deep feature differences to align with human perception, and CLIPScore measures image‑text alignment via cosine similarity of CLIP embeddings.

Recent approaches (2020 onward) incorporate large language‑vision models and transformers. TRIQ and IQT combine CNN backbones with shallow transformers for full‑reference quality prediction. CLIP‑based aesthetic predictors (LAION‑AESTHETICS, CLIP+MLP) and human‑preference models (ImageReward, HPS, X‑IQE) fine‑tune CLIP on human‑rated data to predict aesthetic appeal and text‑image alignment.

Implementation repository for TRIQ: https://github.com/junyongyou/triq Overall, evaluation has evolved from low‑level handcrafted cues to high‑level semantic and perceptual measures, reflecting the increasing sophistication of text‑to‑image diffusion models. Future metrics will likely become domain‑specific, balancing realism, aesthetics, and alignment with user intent.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GaN Evaluation Metrics generative models image quality

Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.