MME-CRS: Multi-Metric Evaluation with Correlation Re-Scaling for Open-Domain Dialogue Evaluation

The paper presents MME‑CRS, a champion method for DSTC10 open‑domain dialogue evaluation that combines seven diverse metrics—fluency, relevance, topic coherence, engagement, and three specificity measures—using a correlation‑re‑scaling algorithm to weight each metric, achieving state‑of‑the‑art Spearman correlation and top rankings across multiple evaluation dimensions.

Meituan Technology Team
Meituan Technology Team
Meituan Technology Team
MME-CRS: Multi-Metric Evaluation with Correlation Re-Scaling for Open-Domain Dialogue Evaluation

The article introduces the champion method MME-CRS (Multi-Metric Evaluation with Correlation Re-Scaling) from the DSTC10 Open-Domain Dialogue Evaluation track. The method designs multiple evaluation metrics and employs a correlation re‑scaling algorithm to integrate the scores of different metrics, providing a reference for more effective dialogue evaluation metrics.

Background : The Dialog System Technology Challenge (DSTC) was launched in 2013 by Microsoft and Carnegie Mellon University to promote advances in dialogue systems. DSTC10, the tenth edition, includes five tracks, with Track 5 Task 1 focusing on automatic open‑domain dialogue evaluation. Automatic evaluation aims to replace costly and slow human annotation with efficient, low‑cost scoring that correlates with human judgments.

Problem Statement : Existing evaluation methods suffer from two main issues: (1) insufficiently comprehensive metrics that cannot fully capture dialogue quality, and (2) lack of effective metric integration techniques, especially given the large number of evaluation dimensions (37 in the validation sets, 11 in the test sets).

Related Work : The paper reviews three families of automatic evaluation methods: Overlap‑based (e.g., BLEU, ROUGE), Embedding‑based (e.g., Greedy Matching, BERTScore), and Learning‑based (e.g., ADEM, USL‑H). It also notes the limitations of each approach, such as dependence on reference responses and limited coverage of dialogue aspects.

Proposed Method : MME‑CRS introduces five categories comprising seven basic metrics:

Fluency Metric (FM): assesses the fluency of the response using a SimCSE‑fine‑tuned model trained on a synthetic DailyDialog fluency dataset.

Relevance Metric (RM): measures the relevance between context and response, also built on SimCSE with carefully constructed negative samples.

Topic Coherence Metric (TCM): evaluates topic consistency using a graph‑based approach (GRADE) with ConceptNet embeddings and GATs.

Engagement Metric (EM): predicts user/agent engagement level based on ConvAI2 data.

Specificity Metrics (SM‑NLL, SM‑NCE, SM‑PPL): quantify the amount of detail in the response via masked language modeling losses.

To integrate these metrics, the authors propose the Correlation Re‑Scaling (CRS) method. First, the correlation between each basic metric and each evaluation dimension is computed on the validation sets. These correlations are raised to a power (set to 2) to emphasize stronger relationships, then normalized to obtain a weight distribution for each dimension. The final score for a dialogue dimension is the weighted sum of the seven metric scores.

Experiments : The method is trained primarily on DailyDialog (EM uses ConvAI2) and evaluated on the DSTC10 test sets. MME‑CRS achieves an average Spearman correlation of 0.3104, ranking first among all participants. Detailed results show first‑place performance on six out of eleven evaluation dimensions across five test datasets.

Ablation Study : Removing individual metrics demonstrates that TCM, RM, and EM contribute the most to performance (drops of 3.26%, 1.56%, and 1.01% respectively). The combination of RM and TCM is especially critical; removing both reduces the average correlation to 11.07%.

CRS Effectiveness : Comparing MME‑CRS with a simple average (MME‑Avg) shows a 3.49% improvement, confirming the advantage of correlation‑based re‑scaling.

Conclusion : The paper addresses the two main challenges in open‑domain dialogue evaluation by (1) designing a comprehensive set of metrics and (2) proposing an effective integration method. While the approach yields state‑of‑the‑art results in DSTC10, future work will explore additional metrics and integration strategies, and apply the technology to Meituan’s voice interaction products such as intelligent outbound robots and customer service.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Open-domain Dialoguecorrelation re-scalingdialogue evaluationDSTC10metric integrationMME-CRS
Meituan Technology Team
Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.