Why Some AI Agents Are Gaming the GAIA Benchmark – A Deep Dive
The article reveals how the GAIA agent benchmark’s publicly available validation set enables participants to cheat by submitting scores derived from known answers, exposing unprofessional practices by teams like Manus and OpenAI and urging the community to rely only on hidden test data for fair evaluation.
