Operations 13 min read

How Transparent AI Boosts Trust in AIOps: Explainable Root‑Cause Solutions

This article examines the rapid growth of the Chinese IT operations market, explains why AIOps faces trust challenges due to opaque deep‑learning models, and presents AsiaInfo's transparent‑model and post‑hoc explanation engine together with three concrete explainable root‑cause analysis methods, concluding with future outlooks for trustworthy AIOps.

AsiaInfo Technology: New Tech Exploration

Oct 20, 2023

How Transparent AI Boosts Trust in AIOps: Explainable Root‑Cause Solutions

Background

AIOps augments traditional IT‑automation with a machine‑learning layer that analyses monitoring data, makes decisions and triggers scripts. In deep‑learning based AIOps the latent layers between inputs and outputs are opaque, which hampers trust and limits adoption.

Explainable AI (XAI) Approaches

Two major families of XAI are used:

Model‑based explainability : design intrinsically interpretable models that output reasons together with predictions.

Result‑based explainability : treat the model as a black box and infer explanations from observed input‑output behavior (e.g., SHAP, LIME).

Transparent‑Model Architecture

The proposed architecture combines a transparent‑model layer (pre‑explainable rule‑based features) with a post‑hoc explanation engine. This makes the model construction process and inference path observable, satisfying compliance requirements and improving user confidence.

Solution 1 – Random‑Walk Call‑Chain Root‑Cause Model

Applicable to micro‑service environments that emit call‑chain logs. The workflow is:

Collect real‑time call‑chain events and compute per‑node performance metrics (latency, error rate, etc.).

Build a transition probability matrix P where P[i][j] is the probability of a call from node i to node j .

Perform a large number of random walks on the graph; the visitation count v_i for each node indicates how often the node participates in abnormal paths.

The node with the highest v_i during an incident is flagged as the root cause. Anomaly detection on its performance metrics (e.g., latency spikes) provides a concrete explanation.

Solution 2 – Star‑Graph Alarm Root‑Cause Model

This model uses the network topology and alarm streams:

Construct a graph where vertices are services/components and edges represent communication links.

Apply a centrality metric (e.g., eigenvector or betweenness centrality) to each vertex.

When an alarm is raised, the centrality of the faulty node increases; correlated nodes exhibit abnormal metrics.

The node with the highest centrality change is identified as the root cause.

Solution 3 – Alarm Aggregation & Association‑Rule Method

Steps:

Cluster incoming alarms by content, timestamp and frequency using a clustering algorithm (e.g., DBSCAN).

Generate heat‑maps to visualise alarm density.

Mine association rules (e.g., Apriori) that link aggregated alarm clusters to underlying root‑cause events.

Present the rules with support and confidence values as interpretable explanations.

Implementation Workflow

Ingest call‑chain or alarm data, compute per‑node metrics, and run anomaly detection (e.g., statistical thresholding, isolation forest) for each node.

Construct the probability transition matrix P and calculate node scores using the random‑walk formula score_i = \sum_j P_{ij} \times v_j (illustrated in the figure below).

Generate anomaly tables, build a causal graph (nodes = abnormal metrics, edges = parent‑to‑child relationships), and visualise fault propagation to aid engineers in tracing the chain of events.

Application Scenario: Fault‑Convergence AI Explainability

The transparent‑model pipeline is used for real‑time alarm convergence and root‑cause positioning. It outputs:

Textual similarity scores between alarms (e.g., cosine similarity of vectorised messages).

Confidence metrics for each association rule.

Visualisations of the causal graph and propagation paths (Figures 9‑13 in the original material).

Future Outlook

Explainable AI will become a prerequisite for trustworthy AIOps. Ongoing research aims to improve model auditability, provide richer confidence indicators, and integrate XAI into automated decision loops. Continued collaboration between academia and industry is expected to deliver more reliable, transparent operations.

operations aiops root cause analysis explainable AI AI trust Transparent Models

Written by

AsiaInfo Technology: New Tech Exploration

AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.