Artificial Intelligence 16 min read

Open‑Source AI‑Infra Ops Agent Benchmark Powered by Hundreds of Billions of Real Data

The article introduces AISHPerf, the first open‑source benchmark for AI‑infra operations agents built on nearly a hundred‑billion real‑world ops records, detailing its data pipeline, multi‑layer coverage, evaluation metrics, experimental results that show current models lag behind human experts, and future plans to expand and refine the benchmark.

Machine Heart

Jun 29, 2026

Open‑Source AI‑Infra Ops Agent Benchmark Powered by Hundreds of Billions of Real Data

Background and Motivation

Global AI infrastructure demand is exploding, with Morgan Stanley forecasting a cumulative $2.9 trillion AI‑infra investment by 2028. Industry cost analysis shows that operations labor, failure loss, and idle clusters account for 15%–20% of total cost, representing a potential $435 billion optimization opportunity.

Benchmark Development (AISHPerf)

The China Academy of Information and Communications Technology (CAICT) launched the first AI‑Infra ops‑agent benchmark, AISHPerf, with technical support from Wu Wen Xin Qiong. The benchmark is open‑source and defines a realistic problem space for AI agents in AI‑native infrastructure.

Dataset and Coverage

Based on nearly a hundred‑billion raw ops records collected from 2024 to January 2026, a strict three‑stage data‑engineering process filtered the data to 100 k high‑quality, complete incidents and abstracted them into 103 test cases. Each case contains the observed symptom, a full troubleshooting chain, and a verified root cause. The benchmark covers five hardware categories (host, high‑performance devices, container platform, training/inference scripts, security/operator), 44 problem types, 22 sub‑fault domains, and five domestic chips (Tianjin, Wall‑e, Mo Er, Ascend 5). Problems are split into three difficulty levels with an average manual handling time of 1.5 hours.

Evaluation Framework

AISHPerf adopts a multi‑dimensional evaluation system. The primary metric is a composite score that weights tasks by difficulty; auxiliary metrics include average latency (seconds per task), average token consumption (tokens per task), and tool‑call count. The benchmark provides a full evaluation stack (AIops‑Eval) consisting of five modules: User (input handling), Agent (the target LLM or agent), Env (environment construction and cleanup), Evaluator (trajectory scoring, including LLM‑as‑judge), and Tracing (execution trace collection via Langfuse).

Experimental Results

We evaluated a ReAct‑loop baseline agent using several domestic and international models. All models scored below 50 points but achieved notable latency improvements. Success rates remain lower than those of human experts. For medium and hard tasks, accuracy drops below 50%, tool‑call time rises sharply, and token consumption stays comparable across difficulty levels. Models handle pure code bugs better than hardware faults, where accuracy is low and token usage is higher, indicating a confidence gap for hardware‑related issues.

Observed Failure Modes

Task stability issues: malformed tokens or unsafe operations cause premature termination.

Poor reasoning chains: superficial fixes, unverified conclusions, or overly generic troubleshooting steps.

Unsafe decisions: dangerous tool calls that can crash the physical environment and require manual intervention.

Future Directions

We will continuously enrich the dataset, expand coverage to more tech stacks and domestic chip families, and improve the AIops‑Chaos fault‑injection suite for richer, more robust failure scenarios. The evaluation framework will be opened to additional agent paradigms beyond ReAct, fostering a public baseline for AI‑infra ops agents. Ongoing collaborations with CAICT, Tsinghua University, and the broader community aim to advance the co‑evolution of AI and infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Evaluation Metrics benchmark infrastructure Fault Injection GPU Cluster AI Ops Large-Scale Data

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.