Artificial Intelligence 7 min read

LLM Ranking Arena: Elo‑Based Competitive Evaluation of Open‑Source Chatbots

A recent study by the LMSYS organization introduces an Elo‑rated, 1v1 battle arena for large language models, ranking open‑source chatbots like Vicuna, Koala, and ChatGLM, while discussing the limitations of traditional benchmarks and the advantages of crowd‑sourced, scalable evaluation.

DataFunSummit

May 4, 2023

LLM Ranking Arena: Elo‑Based Competitive Evaluation of Open‑Source Chatbots

Researchers from the LMSYS organization (led by UC Berkeley) have launched a novel "LLM Ranking Arena" where multiple large language models (LLMs) compete in random 1v1 battles, receiving Elo scores that produce a clear ranking.

The arena collects anonymous user votes after each duel, allowing participants to choose which model performed better, and has gathered over 4.7k valid votes. All evaluation code and data analysis are publicly available.

Current leaderboard highlights Vicuna (130B parameters) at the top with 1169 points, followed by Koala, Open Assistant, and ChatGLM (60B parameters) within the top five, while Meta's LLaMa and Stability AI's StableLM rank lower.

The authors argue that traditional academic benchmarks (e.g., HELM) struggle with subjectivity, data leakage, and limited task coverage, making crowd‑sourced, battle‑based evaluation a more scalable and incremental solution.

Key advantages of the arena include scalability to many models, incremental evaluation of new models with few trials, and a unique ordering that can compare any two models regardless of future additions.

Elo rating mechanics are explained: the win probability is computed using a logistic curve, and scores are updated linearly after each match, mirroring systems used in games like League of Legends and Dota 2.

Statistical analysis shows that Elo predictions align well with actual win rates across model pairings, confirming the method's reliability.

The arena was created by LMSYS Org, founded by UC Berkeley Ph.D. student Lianmin Zheng and UCSD associate professor Hao Zhang, aiming to democratize access to large‑scale model evaluation.

For more details, see the original blog post at https://lmsys.org/blog/2023-05-03-arena/ .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM evaluation Open-source models AI benchmarking Chatbot Arena Elo Rating

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.