A Survey of Image Quality Evaluation Metrics for Text-to-Image Generation
The survey traces the evolution of image‑quality evaluation for text‑to‑image generation—from early handcrafted edge and color cues, through GAN‑era similarity scores such as IS, FID and KID, to modern perceptual and CLIP‑based metrics like LPIPS, CLIPScore, TRIQ, IQT and human‑preference models—highlighting a shift toward semantic, aesthetic, and text‑image alignment measures and forecasting domain‑specific metrics for future diffusion models.
This article reviews evaluation metrics used in image generation models over the past decade, addressing two questions: the characteristics of evaluation standards across periods, and how image quality assessment aids model iteration.
Before 2016, image quality assessment relied on handcrafted features such as edge density, color distribution, and blur detection. Methods included Laplacian edge extraction, color histograms, and frequency analysis to quantify clarity and composition.
From 2016 to 2019, with the rise of GANs, metrics shifted to measuring similarity between generated and real images. Inception Score (IS) evaluates diversity and quality using a pretrained Inception‑v3 classifier, while Frechet Inception Distance (FID) compares feature distributions assuming Gaussianity. Both have limitations, e.g., sensitivity to mode collapse and inability to detect over‑fitting.
Kernel Inception Distance (KID) replaces the Gaussian assumption with an unbiased Maximum Mean Discrepancy estimate, offering a more reliable distance at higher computational cost.
Perceptual metrics such as Learned Perceptual Image Patch Similarity (LPIPS) use deep feature differences to align with human perception, and CLIPScore measures image‑text alignment via cosine similarity of CLIP embeddings.
Recent approaches (2020 onward) incorporate large language‑vision models and transformers. TRIQ and IQT combine CNN backbones with shallow transformers for full‑reference quality prediction. CLIP‑based aesthetic predictors (LAION‑AESTHETICS, CLIP+MLP) and human‑preference models (ImageReward, HPS, X‑IQE) fine‑tune CLIP on human‑rated data to predict aesthetic appeal and text‑image alignment.
Implementation repository for TRIQ: https://github.com/junyongyou/triq
Overall, evaluation has evolved from low‑level handcrafted cues to high‑level semantic and perceptual measures, reflecting the increasing sophistication of text‑to‑image diffusion models. Future metrics will likely become domain‑specific, balancing realism, aesthetics, and alignment with user intent.
DaTaobao Tech
Official account of DaTaobao Technology
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.