First Enterprise IT Ops Agent Benchmark Shows Claude Leads with Just 47% Score
The ITBench-AA benchmark, the first evaluation specifically for enterprise IT operations agents, tests 59 SRE scenarios and reveals that even top models like Claude Opus 4.7 achieve only a 47% score, highlighting both the difficulty of the tasks and the cost‑effectiveness gap between proprietary and open‑source agents.
