How ReLE Redefines Chinese LLM Evaluation and Reveals Capability Anisotropy
The ReLE framework introduces a dynamic, variance‑aware evaluation system that diagnoses capability anisotropy across 304 Chinese large language models, exposing ranking instability, commercial‑vs‑open‑source gaps, and format barriers while cutting evaluation cost by 70%.
Background and Motivation
Traditional Chinese LLM benchmarks (e.g., C‑Eval, CLUE, AGIEval) have reached a performance ceiling, causing score distributions to collapse. This “evaluation crisis” makes a single scalar score insufficient to reflect true model capabilities.
ReLE: Robust Efficient Live Evaluation
ReLE (Robust Efficient Live Evaluation) is a scalable evaluation system and structured benchmark that diagnoses capability anisotropy in Chinese LLMs. The study evaluated 304 models (189 commercial, 115 open‑source) on 207,843 samples , achieving a 70 % reduction in compute cost while preserving ranking relevance.
Core Methodology
Dynamic Variance‑Aware Scheduler : Evaluation is cast as a stratified sequential estimation problem. A two‑stage Neyman allocation sampling strategy is used:
Stage 1 – variance probing: a small sample is run for each model to estimate performance variance across dimensions.
Stage 2 – dynamic allocation: models with high variance receive additional budget, while low‑variance models are pruned, reducing total cost to roughly $20,700.
Symbolic‑Grounded Hybrid Scoring : To avoid self‑preference bias and false positives from embedding similarity, a three‑layer pipeline is applied:
Semantic filtering.
LLM judge.
Bias calibration using adversarial samples.
Objective tasks (≈68 % of the benchmark) are verified with symbolic solvers such as SymPy. Semi‑objective tasks (≈24 %) rely on the calibrated LLM judge, reaching a Cohen’s κ of 0.81 against human experts.
Domain‑Capability Orthogonal Matrix : The benchmark defines a matrix with 7 industry domains (vertical axis) and 22 cognitive capability dimensions (horizontal axis). This orthogonal layout enables precise attribution of failures, e.g., distinguishing lack of legal knowledge from insufficient logical reasoning.
Key Findings from the 304‑Model Study
Ranking Stability (RSA) : The Ranking Stability Amplitude (RSA) quantifies how much model rankings shift when evaluation weights change. Traditional benchmarks report RSA ≈ 5.0; ReLE observes an average RSA of 11.4, indicating that a model ranked 8th on a balanced leaderboard could fall to 32nd in a domain‑specific scenario.
Commercial vs. Open‑Source Models : Commercial models retain an advantage of ~12 points in specialized domains (medical, legal). Open‑source models close the gap in general reasoning but still lag on long‑chain logical tasks. Model price shows weak correlation with ability; models priced 1–5 CNY perform within 3.2 % of >5 CNY models on 8 of 22 dimensions.
Agent Format Barrier : Specialized agent models achieve 74.8 % on tool‑use tasks, while general commercial models score 62.4 %. The gap is largely due to format mis‑alignment: general models often emit verbose explanations instead of the required JSON calls, exposing a disconnect between latent capability and interface compliance.
Conclusion and Outlook
ReLE shifts evaluation from static leaderboards to dynamic diagnostic reporting, providing a detailed health report for 304 LLMs and confirming that capability anisotropy is inherent to current models. The authors advocate “Capability Portfolio Management” – selecting models that best match specific business requirements rather than chasing a single “perfect” model.
Resources: arXiv preprint https://arxiv.org/abs/2601.17399; GitHub repository https://github.com/jeinlee1991/chinese-llm-benchmark (contains failure cases and evaluation scripts).
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
