Tagged articles

AI Agent Evaluation

1 articles · Page 1 of 1

Mar 31, 2026 · Artificial Intelligence

What Does DeepResearch Bench Measure? Toward Human‑Level AI Agent Evaluation

The DeepResearch Bench and Bench II, open‑source benchmarks from the USTC team, evaluate deep‑research AI agents on report quality, citation reliability, and information recall using the RACE and FACT frameworks, aiming to align automated scores with human expert judgments.

AI Agent EvaluationDeepResearch BenchFACT

0 likes · 12 min read

What Does DeepResearch Bench Measure? Toward Human‑Level AI Agent Evaluation