Tagged articles
1 articles
Page 1 of 1
Machine Heart
Machine Heart
Mar 31, 2026 · Artificial Intelligence

What Does DeepResearch Bench Measure? Toward Human‑Level AI Agent Evaluation

The DeepResearch Bench and Bench II, open‑source benchmarks from the USTC team, evaluate deep‑research AI agents on report quality, citation reliability, and information recall using the RACE and FACT frameworks, aiming to align automated scores with human expert judgments.

AI Agent EvaluationDeepResearch BenchFACT
0 likes · 12 min read
What Does DeepResearch Bench Measure? Toward Human‑Level AI Agent Evaluation