Artificial Intelligence 5 min read

Grok‑3 Evaluation Controversy and Community Reactions

Three days after Grok‑3’s launch, OpenAI was accused of inflating its benchmark scores by using a “cons@64” method that aggregates 64 answers, a practice critics say unfairly skews comparisons with single‑shot models like o3‑mini, while developers have already begun experimenting with the model in simple games.

Java Tech Enthusiast

Feb 22, 2025

Grok‑3 Evaluation Controversy and Community Reactions

Three days after its launch, Grok‑3 was accused of manipulating benchmark results by using a “cons@64” evaluation method, which aggregates 64 answers and reports the most frequent one.

OpenAI’s application lead highlighted that the lighter‑colored portion of the Grok‑3 performance chart represents scores obtained with cons@64, not single‑shot answers, and called the practice misleading.

Critics argue that comparing Grok‑3 (using cons@64) with other models such as o3‑mini, o1, DeepSeek‑R1, and Gemini‑2 Flash, which were evaluated with single answers, is unfair. Data from the o3‑mini blog shows it outperforms Grok‑3 on single‑shot tasks.

Further analysis notes that o1 can achieve comparable results when evaluated with cons@64, suggesting the method gives an advantage. OpenAI has not applied cons@64 to o3‑mini, reinforcing concerns about consistency.

Beyond the controversy, developers have quickly experimented with Grok‑3, creating simple games (e.g., a breakout clone) using short prompts in Replit, and even a classic brick‑breaker demo by former Windows engineer Dave Plummer.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI benchmark cheating cons@64 Grok-3 Model Evaluation OpenAI

Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.