Artificial Intelligence 9 min read

Beyond ROUGE: GLUE, SuperGLUE, MMLU, C‑Eval & HELM Transform NLP Evaluation

Evaluating language models solely with ROUGE or BLEU is insufficient, so comprehensive benchmarks like GLUE, SuperGLUE, MMLU, C‑Eval, and HELM provide diverse tasks and metrics that more accurately assess linguistic understanding, knowledge acquisition, and robustness across English and Chinese NLP systems.

Software Development Quality

Oct 19, 2023

Beyond ROUGE: GLUE, SuperGLUE, MMLU, C‑Eval & HELM Transform NLP Evaluation

When assessing a model, relying only on ROUGE or BLEU scores is too narrow and cannot fully reflect a model's capabilities. Comprehensive evaluation requires a set of effective benchmarks such as GLUE, SuperGLUE, HELM, MMLU, and C‑Eval.

Natural Language Understanding Benchmarks: GLUE and SuperGLUE

GLUE (General Language Understanding Evaluation) was created in 2018 by institutions including NYU and the University of Washington. It comprises nine tasks:

CoLA – binary classification of grammatical acceptability.

SST – sentiment classification from movie reviews (SST‑2 binary, SST‑5 fine‑grained).

MRPC – paraphrase detection between sentence pairs.

STS‑B – semantic textual similarity regression (0‑5 score) on sentence pairs.

QQP – question similarity detection from Quora.

MNLI – three‑way natural language inference on multi‑genre premises.

QNLI – binary answerability classification derived from SQuAD.

RTE – binary textual entailment.

WNLI – Winograd Schema natural language inference.

GLUE results in August 2023 show many models surpass human baselines on several tasks, but this does not imply true language mastery. To avoid misleading conclusions, SuperGLUE was introduced, retaining only RTE and WSC (formerly GLUE’s WNLI) and adding five more challenging tasks:

CB (CommitmentBank) – short‑story classification evaluated by accuracy and unweighted F1.

COPA – causal reasoning with two answer choices, evaluated by accuracy.

GAP – gender‑balanced pronoun coreference, evaluated by F1 and bias ratio.

MultiRC – multi‑sentence reading comprehension with macro‑average F1 metrics.

WiC – word‑in‑context disambiguation, evaluated by accuracy.

Knowledge Acquisition Benchmarks: MMLU and C‑Eval

MMLU (Massive Multitask Language Understanding) measures knowledge acquisition via zero‑ and few‑shot performance across 57 tasks covering elementary math, history, computer science, law, ethics, and more.

Chinese researchers from Tsinghua University and Shanghai Jiao‑Tong University released C‑Eval, a Chinese MMLU counterpart containing 13,948 multiple‑choice questions across 52 subjects and four difficulty levels.

C‑Eval accepts submissions and provides rankings for popular models; detailed results are available on its official website.

Holistic Multi‑Metric Benchmark: HELM

HELM (Holistic Evaluation of Language Models) aggregates seven evaluation dimensions—accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency—to provide a comprehensive, standardized assessment of model performance across diverse scenarios.

Choosing appropriate benchmarks ensures that models encounter novel data, yielding more reliable and objective performance feedback.

AI benchmark evaluation NLP language models

Written by

Software Development Quality

Discussions on software development quality, R&D efficiency, high availability, technical quality, quality systems, assurance, architecture design, tool platforms, test development, continuous delivery, continuous testing, etc. Contact me with any article questions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.