How LMSYS Chatbot Arena Ranks Yi‑Large Among Global LLMs: Insights & Methodology
The LMSYS Chatbot Arena benchmark, using blind user voting and an Elo scoring system, placed China's Yi‑Large model among the top global large language models, detailing its methodology, ranking results, and the broader implications for the AI industry.
Background
Chatbot Arena, run by LMSYS Org, provides an open blind‑test platform where real users anonymously compare responses from paired large language models (LLMs). Models are submitted by developers and randomly paired; users submit prompts without seeing model identities and vote for the better answer (A, B, tie, or both bad). The platform records millions of votes and updates model Elo ratings after each match.
Methodology
Each match uses an Elo rating system: a lower‑rated model gains more points when it wins, while a higher‑rated model gains fewer. LMSYS introduced a duplicate‑query removal filter that discards overly repetitive prompts (e.g., repeated “Hello”) to improve data diversity. All vote data are publicly released after anonymisation.
Latest Rankings (May 20 2024)
Global leaderboard (Elo scores):
GPT‑4o – 1287 (rank 1)
Yi‑Large (Zero‑One Wanwu) – ~1240 (rank 7, highest Chinese model)
GPT‑4‑Turbo – ~1240
Gemini 1.5 Pro – ~1240
Claude 3 Opus – ~1240
Llama‑3‑70B‑Instruct – ~1200
Claude 3 Sonnet – ~1200
Chinese‑language leaderboard: Yi‑Large and GPT‑4o tie for first place, followed by Qwen‑Max and GLM‑4.
Category‑Specific Results
Three challenging evaluation tracks are published by LMSYS:
Coding – Yi‑Large scores above Claude 3 Opus and is tied with GPT‑4‑Turbo and GPT‑4 for second place.
Longer Query – Yi‑Large ranks second globally, alongside GPT‑4‑Turbo, GPT‑4, and Claude 3 Opus.
Hard Prompts – Yi‑Large again attains the second position, matching GPT‑4‑Turbo, GPT‑4, and Claude 3 Opus.
Technical Observations
The arena includes 44 participating models, covering open‑source (e.g., Llama‑3‑70B) and proprietary systems from major AI labs. Duplicate‑query removal has shifted Yi‑Large’s Elo upward, placing it jointly fourth with Claude 3 Opus and GPT‑4‑0125‑preview after cleaning.
Implications
The blind‑test design reduces “ranking‑gaming” such as training‑set leakage or curated prompt sets, providing a more objective measure of real‑world performance. The public Elo scores enable researchers and developers to compare capabilities across models without relying on synthetic benchmarks.
Access
Live voting interface: https://arena.lmsys.org/
Continuously updated leaderboard: https://chat.lmsys.org/?leaderboard
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
