Tagged articles

benchmark cheating

3 articles · Page 1 of 1

Apr 7, 2025 · Artificial Intelligence

Llama 4 Open‑Source Release Marred by Performance Failures and Alleged Training‑Data Cheating

Meta's newly released Llama 4 quickly became a controversy as internal leaks reveal training‑data cheating, benchmark over‑optimization, and disappointing code‑generation performance that fails to match even older models, prompting resignations and widespread criticism from the AI community.

AI model performanceLlama 4Meta AI

0 likes · 7 min read

Llama 4 Open‑Source Release Marred by Performance Failures and Alleged Training‑Data Cheating

Java Tech Enthusiast

Feb 22, 2025 · Artificial Intelligence

Grok‑3 Evaluation Controversy and Community Reactions

Three days after Grok‑3’s launch, OpenAI was accused of inflating its benchmark scores by using a “cons@64” method that aggregates 64 answers, a practice critics say unfairly skews comparisons with single‑shot models like o3‑mini, while developers have already begun experimenting with the model in simple games.

AIGrok 3OpenAI

0 likes · 5 min read

Grok‑3 Evaluation Controversy and Community Reactions

Baobao Algorithm Notes

Nov 11, 2024 · Artificial Intelligence

Sneaky Tricks to Inflate Deep Learning Model Scores (And Why They’re Misleading)

The article enumerates a series of dubious techniques—from inflating batch sizes and hidden compute to hyper‑parameter tricks and fabricated evaluation methods—designed to artificially boost deep‑learning model scores on benchmarks, exposing how easy it is to game performance metrics.

AI tricksDeep Learningbenchmark cheating

0 likes · 9 min read

Sneaky Tricks to Inflate Deep Learning Model Scores (And Why They’re Misleading)

benchmark cheating

Llama 4 Open‑Source Release Marred by Performance Failures and Alleged Training‑Data Cheating

Grok‑3 Evaluation Controversy and Community Reactions

Sneaky Tricks to Inflate Deep Learning Model Scores (And Why They’re Misleading)

Llama 4 Open‑Source Release Marred by Performance Failures and Alleged Training‑Data Cheating