How PertEval Reveals the Real Knowledge Limits of Large Language Models

At NeurIPS 2024, Alibaba Cloud's PAI team presented the Spotlight paper PertEval, which introduces knowledge‑invariant perturbations to expose the true knowledge capacity of LLMs, critiques over‑optimistic static benchmarks, and showcases responsible AI solutions and platform demos for enterprise use.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How PertEval Reveals the Real Knowledge Limits of Large Language Models

On December 10, the NeurIPS 2024 conference opened in Vancouver, receiving over 15,000 paper submissions. Alibaba Cloud PAI team’s paper “PertEval: Unveiling Real Knowledge Capacity of LLMs with Knowledge‑Invariant Perturbations” was selected as a Spotlight (3% acceptance).

Innovative Evaluation Method

PertEval introduces “knowledge‑invariant perturbations” that rewrite static benchmark questions without altering the underlying knowledge, mitigating memorization and data‑contamination effects and providing a more reliable measure of LLM knowledge.

Findings on Existing Benchmarks

Re‑evaluating six representative LLMs, including GPT‑4, the study found that performance on static benchmarks such as MMLU is significantly over‑estimated (GPT‑4 by 26%). The inflation stems from models’ hesitation on uncertain knowledge and rote memorization of correct answers.

Impact and Future Work

The results highlight the need for evaluation methods that reflect true model capabilities. PertEval will be integrated into the PAI platform, enabling one‑click real‑knowledge assessment for any model and publishing a leaderboard of “knowledge ability” scores.

Responsible AI Demo at NeurIPS

The team also delivered a keynote on “Core Technical Interpretation and Best Practices of Responsible AI,” showcasing Alibaba Cloud’s enterprise‑grade trustworthy AI solution, which includes a T‑shaped safety architecture, fairness and error analysis, and a suite of demo capabilities such as fine‑tuning Qwen2.5‑Coder, building RAG systems, and AI‑generated content tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsevaluationAlibaba CloudNeurIPS 2024responsible AIPertEval
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.