Artificial Intelligence 11 min read

How a Chinese Team Reclaimed the Top Spot on the AI Agent Leaderboard After the OpenAI Ranking Scandal

The article analyzes the MLE‑Bench benchmark, Baidu's Famou 2.0 agent achieving a new SOTA score, the controversy over Disarray's cheating, and real‑world deployments in automotive, banking, and aerospace, illustrating how Harness Engineering is becoming the decisive factor in AI agent performance.

Machine Heart

Apr 11, 2026

How a Chinese Team Reclaimed the Top Spot on the AI Agent Leaderboard After the OpenAI Ranking Scandal

Leaderboard Turmoil: The Battle Over AI Evaluation Standards

In October of the previous year, Baidu's Famou Agent scored 43.56 on the OpenAI‑run MLE‑Bench benchmark, briefly topping the leaderboard and prompting a surge of interest from nearly ten top teams.

In December, Baidu released Famou 2.0, improving the score to 59.56 and reclaiming the top position.

During the same period, a startup called Disarray submitted a 77.78 score that appeared to exploit a loophole: the agent produced "0.0 error" on GPS tasks and unrealistically low scores on image tasks by receiving binary feedback from a private test set and calling external data sources.

MLE‑Bench responded on March 23 by adding a clean‑track (No Private LB) that isolates methods with suspected data leakage, labeling them accordingly. After the clean‑track was introduced, Baidu's Famou 2.0 submitted a score of 64.44 without any private feedback or external data, securing an undisputed first place.

Why Famou 2.0 Won

Baidu’s success is attributed to three core advances in Harness Engineering:

Multi‑agent parallel exploration : The system spawns multiple initial algorithmic solutions, distributes them across "islands" as a population, and iteratively evolves them using large‑scale mutation and crossover on a distributed cluster, converging toward a global optimum without manual engineering of each capability.

Long‑term memory mechanism : This component mitigates the "forget‑after‑the‑first‑step" issue of large models, allowing the agent to retain analysis, decisions, and intermediate results across long‑chain tasks, keeping reasoning coherent.

Optimized infrastructure : Leveraging Baidu’s AI Cloud stack, the system achieves superior resource scheduling, parallel task execution, and fault‑tolerant recovery, making the overall pipeline faster, more stable, and more reliable.

Benchmarks Validate, Industry Applies

Beyond the leaderboard, Famou’s Harness Engineering has been deployed in several real‑world scenarios:

Automotive design : The Chinese design firm Al‑te integrated Famou with its AI core to create the "Yufeng" predictive system, reducing aerodynamic analysis from ten hours to a few minutes with less than 5% error, cutting vehicle development cycles by 25%.

Banking risk control : CITIC Baixin Bank uses Famou as an autonomous "strategy evolution master" for feature mining, doubling feature‑extraction efficiency and improving risk‑segmentation by 2.41%.

Aerospace air‑quality monitoring : Beijing University of Technology applied Famou to micro‑air‑quality sensors on the Chinese space station, optimizing chromatographic column flow fields and markedly increasing gas‑separation efficiency.

Disaster prediction : Tianjin University employed Famou for landslide displacement and rock‑burst model selection, compressing a weeks‑long manual trial process into six hours.

These deployments illustrate how AI agents, when equipped with robust Harness Engineering, can free human experts from repetitive trial‑and‑error, allowing them to focus on defining scientific problems and discovering new principles.

Conclusion

The progression from model‑centric competition to framework‑centric competition marks a watershed in AI engineering. Baidu’s Famou demonstrates that a complete AI agent architecture, built on systematic Harness Engineering, can autonomously evolve optimal solutions across diverse, high‑complexity tasks, establishing a new paradigm for production‑level AI.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Agent Industrial AI Harness Engineering Baidu Famou MLE-Bench Multi-Agent Evolution

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Leaderboard Turmoil: The Battle Over AI Evaluation Standards

Why Famou 2.0 Won

Benchmarks Validate, Industry Applies

Conclusion

Machine Heart

How this landed with the community

Was this worth your time?

0 Comments

Why Famou 2.0 Won