Artificial Intelligence 9 min read

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

The Q‑Eval‑100K dataset, comprising 100 k AIGC images and videos with separate visual‑quality and textual‑consistency annotations, powers the open‑source Q‑Eval‑Score framework that fine‑tunes multimodal models to deliver state‑of‑the‑art, scalable, and objective evaluation—including a “vague‑to‑specific” strategy for long prompts—surpassing existing benchmarks.

Meituan Technology Team

Mar 27, 2025

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

In 2024, CVPR received 13,008 valid submissions, of which 2,878 papers were accepted (22.1% acceptance rate). Multimodal research remains a major focus.

The Shanghai Jiao Tong University‑Meituan Computing and Intelligence Joint Lab published a paper (available at arXiv:2503.02357 ) that introduces the Q‑Eval‑100K dataset and the Q‑Eval‑Score evaluation framework.

The paper addresses two critical problems in existing text‑to‑visual evaluation datasets: (1) lack of systematic evaluation dimensions, inability to separate visual quality from textual consistency, and insufficient scale; (2) complex and ambiguous evaluation processes that hinder the application of large‑model‑based evaluators.

Experiments show that both the dataset and the method achieve state‑of‑the‑art performance in evaluation quality and generalization.

Q‑Eval‑100K contains 100K AIGC samples (60K images and 40K videos), far surpassing existing datasets in both instance count and human annotations.

Cross‑dataset validation demonstrates that models trained on Q‑Eval‑100K outperform current best methods on the GenAI‑Bench dataset, confirming the dataset’s strong generalization value.

The dataset enables a new era of text‑to‑visual content evaluation, while Q‑Eval‑Score provides an open‑source, accurate, and objective scoring framework for AIGC image and video generation models.

Dataset Construction Principles

1) Diversity : Prompts are designed across three major dimensions—entity generation (people, objects, animals, etc.), entity attribute generation (clothing, color, material, etc.), and cross‑ability items (background, spatial relationship, etc.). Diverse prompts are generated using SOTA AIGC models such as FLUX, Lumina‑T2X, PixArt, Stable Diffusion 3, CogVideoX, Runway GEN‑3, Kling, etc.

2) High‑quality annotation : Over 200 trained annotators provided more than 960 k rating entries, which were filtered to produce the final 100 k samples with consistency and quality scores.

3) Decoupled visual quality and textual consistency labeling : The two dimensions are annotated separately, allowing Q‑Eval‑Score to evaluate them independently.

The dataset is now available on the AGI‑Eval community evaluation hub.

Unified Evaluation Framework (Q‑Eval‑Score)

Q‑Eval‑Score converts the dataset into a supervised‑fine‑tuning (SFT) format for large multimodal models (LMM). A context‑prompt template is built for SFT, and human scores (1‑5) are mapped to five levels {Bad, Poor, Fair, Good, Excellent}. The final score is computed by weighting the logits of the five levels.

The Qwen2‑VL‑7B‑Instruct model is fine‑tuned with both CE loss and MSE loss to enhance scoring ability.

Long‑Prompt Alignment Issue

When prompts exceed 25 words, the model tends to underestimate scores due to the scarcity of long prompts in the training set. To address this, the authors propose a “Vague‑to‑Specific” strategy: split a long prompt into a vague prompt and multiple specific prompts, evaluate each separately, and combine the results.

For specific prompts, the question is reformulated (e.g., “Does the image/video show [prompt]?”) inspired by VQAScore, and a weighted aggregation yields the final alignment score.

Experimental Conclusions

Q‑Eval‑Score achieves superior performance on both visual quality and textual consistency evaluation. It surpasses all current SOTA models in Spearman rank correlation coefficient (SRCC) and Pearson linear correlation coefficient (PLCC) for images and videos. On textual consistency, it leads other models by 6% (image) and 12% (video) in instance‑level SRCC.

Ablation studies confirm that each proposed strategy and loss function contributes significantly to performance gains. The “Vague‑to‑Specific” strategy notably improves evaluation on long‑prompt subsets.

The release of Q‑Eval‑100K and Q‑Eval‑Score provides a more reliable and comprehensive solution for evaluating text‑to‑visual models, fostering further development and practical deployment of generative AI. The AGI‑Eval community continues to support open, fair, and scientific model evaluation.

An online sharing session on ICLR & CVPR papers, featuring this work, will be organized by the Meituan tech team in April 2025.

machine learning multimodal AIGC evaluation Vision-Language dataset

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.