How a Chinese Team Reclaimed the Top Spot on the AI Agent Leaderboard After the OpenAI Ranking Scandal

The article analyzes the MLE‑Bench benchmark, Baidu's Famou 2.0 agent achieving a new SOTA score, the controversy over Disarray's cheating, and real‑world deployments in automotive, banking, and aerospace, illustrating how Harness Engineering is becoming the decisive factor in AI agent performance.

Machine Heart
Machine Heart
Machine Heart
How a Chinese Team Reclaimed the Top Spot on the AI Agent Leaderboard After the OpenAI Ranking Scandal

Leaderboard Turmoil: The Battle Over AI Evaluation Standards

In October of the previous year, Baidu's Famou Agent scored 43.56 on the OpenAI‑run MLE‑Bench benchmark, briefly topping the leaderboard and prompting a surge of interest from nearly ten top teams.

In December, Baidu released Famou 2.0, improving the score to 59.56 and reclaiming the top position.

During the same period, a startup called Disarray submitted a 77.78 score that appeared to exploit a loophole: the agent produced "0.0 error" on GPS tasks and unrealistically low scores on image tasks by receiving binary feedback from a private test set and calling external data sources.

MLE‑Bench responded on March 23 by adding a clean‑track (No Private LB) that isolates methods with suspected data leakage, labeling them accordingly. After the clean‑track was introduced, Baidu's Famou 2.0 submitted a score of 64.44 without any private feedback or external data, securing an undisputed first place.

Why Famou 2.0 Won

Ba​idu’s success is attributed to three core advances in Harness Engineering:

Multi‑agent parallel exploration : The system spawns multiple initial algorithmic solutions, distributes them across "islands" as a population, and iteratively evolves them using large‑scale mutation and crossover on a distributed cluster, converging toward a global optimum without manual engineering of each capability.

Long‑term memory mechanism : This component mitigates the "forget‑after‑the‑first‑step" issue of large models, allowing the agent to retain analysis, decisions, and intermediate results across long‑chain tasks, keeping reasoning coherent.

Optimized infrastructure : Leveraging Baidu’s AI Cloud stack, the system achieves superior resource scheduling, parallel task execution, and fault‑tolerant recovery, making the overall pipeline faster, more stable, and more reliable.

Benchmarks Validate, Industry Applies

Beyond the leaderboard, Famou’s Harness Engineering has been deployed in several real‑world scenarios:

Automotive design : The Chinese design firm Al‑te integrated Famou with its AI core to create the "Yufeng" predictive system, reducing aerodynamic analysis from ten hours to a few minutes with less than 5% error, cutting vehicle development cycles by 25%.

Banking risk control : CITIC Baixin Bank uses Famou as an autonomous "strategy evolution master" for feature mining, doubling feature‑extraction efficiency and improving risk‑segmentation by 2.41%.

Aerospace air‑quality monitoring : Beijing University of Technology applied Famou to micro‑air‑quality sensors on the Chinese space station, optimizing chromatographic column flow fields and markedly increasing gas‑separation efficiency.

Disaster prediction : Tianjin University employed Famou for landslide displacement and rock‑burst model selection, compressing a weeks‑long manual trial process into six hours.

These deployments illustrate how AI agents, when equipped with robust Harness Engineering, can free human experts from repetitive trial‑and‑error, allowing them to focus on defining scientific problems and discovering new principles.

Conclusion

The progression from model‑centric competition to framework‑centric competition marks a watershed in AI engineering. Baidu’s Famou demonstrates that a complete AI agent architecture, built on systematic Harness Engineering, can autonomously evolve optimal solutions across diverse, high‑complexity tasks, establishing a new paradigm for production‑level AI.

AI AgentIndustrial AIHarness EngineeringBaidu FamouMLE-BenchMulti-Agent Evolution
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.