Industry Insights 12 min read

How LMSYS Chatbot Arena Ranks Yi‑Large Among Global LLMs: Insights & Methodology

The LMSYS Chatbot Arena benchmark, using blind user voting and an Elo scoring system, placed China's Yi‑Large model among the top global large language models, detailing its methodology, ranking results, and the broader implications for the AI industry.

NewBeeNLP
NewBeeNLP
NewBeeNLP
How LMSYS Chatbot Arena Ranks Yi‑Large Among Global LLMs: Insights & Methodology

Background

Chatbot Arena, run by LMSYS Org, provides an open blind‑test platform where real users anonymously compare responses from paired large language models (LLMs). Models are submitted by developers and randomly paired; users submit prompts without seeing model identities and vote for the better answer (A, B, tie, or both bad). The platform records millions of votes and updates model Elo ratings after each match.

Methodology

Each match uses an Elo rating system: a lower‑rated model gains more points when it wins, while a higher‑rated model gains fewer. LMSYS introduced a duplicate‑query removal filter that discards overly repetitive prompts (e.g., repeated “Hello”) to improve data diversity. All vote data are publicly released after anonymisation.

Latest Rankings (May 20 2024)

Global leaderboard (Elo scores):

GPT‑4o – 1287 (rank 1)

Yi‑Large (Zero‑One Wanwu) – ~1240 (rank 7, highest Chinese model)

GPT‑4‑Turbo – ~1240

Gemini 1.5 Pro – ~1240

Claude 3 Opus – ~1240

Llama‑3‑70B‑Instruct – ~1200

Claude 3 Sonnet – ~1200

Chinese‑language leaderboard: Yi‑Large and GPT‑4o tie for first place, followed by Qwen‑Max and GLM‑4.

Category‑Specific Results

Three challenging evaluation tracks are published by LMSYS:

Coding – Yi‑Large scores above Claude 3 Opus and is tied with GPT‑4‑Turbo and GPT‑4 for second place.

Longer Query – Yi‑Large ranks second globally, alongside GPT‑4‑Turbo, GPT‑4, and Claude 3 Opus.

Hard Prompts – Yi‑Large again attains the second position, matching GPT‑4‑Turbo, GPT‑4, and Claude 3 Opus.

Technical Observations

The arena includes 44 participating models, covering open‑source (e.g., Llama‑3‑70B) and proprietary systems from major AI labs. Duplicate‑query removal has shifted Yi‑Large’s Elo upward, placing it jointly fourth with Claude 3 Opus and GPT‑4‑0125‑preview after cleaning.

Implications

The blind‑test design reduces “ranking‑gaming” such as training‑set leakage or curated prompt sets, providing a more objective measure of real‑world performance. The public Elo scores enable researchers and developers to compare capabilities across models without relying on synthetic benchmarks.

Access

Live voting interface: https://arena.lmsys.org/

Continuously updated leaderboard: https://chat.lmsys.org/?leaderboard

industry insightsAI benchmarkingChatbot ArenaLLM evaluationElo RankingYi-Large
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.