Industry Insights 4 min read

Why Some AI Agents Are Gaming the GAIA Benchmark – A Deep Dive

The article reveals how the GAIA agent benchmark’s publicly available validation set enables participants to cheat by submitting scores derived from known answers, exposing unprofessional practices by teams like Manus and OpenAI and urging the community to rely only on hidden test data for fair evaluation.

Baobao Algorithm Notes

Mar 15, 2025

Why Some AI Agents Are Gaming the GAIA Benchmark – A Deep Dive

GAIA is a benchmark designed to assess the capabilities of AI agents by giving them tasks such as analyzing an Excel sheet and reporting the highest math score for a specific class. The model must invoke external tools to complete the task and return the final result.

Typical leaderboards consist of two parts: a visible validation set and a hidden test set. The best leaderboards follow a Kaggle‑style workflow where participants submit models, code, or APIs; the test set remains unseen, ensuring the credibility of the scores. Users can evaluate their performance on the validation set locally before submitting for the hidden test set.

The problem with GAIA is that its validation data is openly downloadable, effectively giving each participant a full set of questions and answers. This allows some participants to compute perfect scores on the validation set and submit those numbers as leaderboard results.

Evidence shows that teams such as Manus and OpenAI reported unusually high scores (e.g., 67.92, 67.44, 42.31) that match the validation‑set results from the H2O benchmark, suggesting they evaluated on the validation data rather than the hidden test set.

The author condemns this behavior as unprofessional, calls it a “bad practice” originating from OpenAI, and urges all participants to submit only on the official test set. The author also notes that GAIA officials have been consulted and appear to confirm the issue.

import pandas as pd
# Read Excel file
df = pd.read_excel('test.xlsx')
# Compute total score for each student
df['total_score'] = df[['math','english','chinese','physics','chemistry']].sum(axis=1)
# Find the student with the highest total score
max_total_score_student = df.loc[df['total_score'].idxmax()]
# Output that student's math score
math_score_of_max_total_score_student = max_total_score_student['math']
print(f"三年二班总分最高的同学的数学分数是: {math_score_of_max_total_score_student}")

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GAIA benchmark leaderboard integrity validation set

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.