Tag

root cause analysis

0 views collected around this technical thread.

Efficient Ops
Efficient Ops
Apr 22, 2025 · Operations

How AI Agents Are Transforming IT Operations and Fault Management

This article explores how AI agents powered by large models can predict failures, perform root‑cause analysis, enhance knowledge‑based Q&A, automate change releases, and enable intelligent decision‑making, dramatically improving efficiency and reliability in modern IT operations.

AI OpsAutomationFault Prediction
0 likes · 7 min read
How AI Agents Are Transforming IT Operations and Fault Management
Aikesheng Open Source Community
Aikesheng Open Source Community
Mar 25, 2025 · Databases

ChatDBA vs DeepSeek: AI‑Driven Diagnosis of OceanBase Backup Cluster Tenant Sync Issue (Case Study)

This case study demonstrates how the AI assistant ChatDBA identifies and resolves a tenant data‑synchronization failure in an OceanBase primary‑backup cluster, detailing four interactive troubleshooting rounds, the final SQL fix, and a comparative analysis with the DeepSeek‑R1 model.

AI AssistantChatDBADatabase Troubleshooting
0 likes · 5 min read
ChatDBA vs DeepSeek: AI‑Driven Diagnosis of OceanBase Backup Cluster Tenant Sync Issue (Case Study)
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Oct 9, 2024 · Operations

AIOps Implementation at Xiaohongshu: Fault Localization and Intelligent Operations

Xiaohongshu’s AIOps initiative builds a four‑layer framework that leverages machine‑learning‑driven anomaly detection, causal analysis, and trace‑based fault localization to automatically identify root‑cause services in micro‑service environments, achieving over 80 % accuracy across 1000 daily diagnoses while guiding future enhancements in change correlation and automated remediation.

AIOpsAnomaly DetectionDevOps
0 likes · 28 min read
AIOps Implementation at Xiaohongshu: Fault Localization and Intelligent Operations
DataFunSummit
DataFunSummit
Sep 19, 2024 · Artificial Intelligence

AI-Powered Anomaly Diagnosis and Root Cause Analysis for Gaming Business Intelligence

This article presents 37 Mobile Games' exploration of AI-driven intelligent analysis, covering abnormal diagnosis, root‑cause analysis, QBI fluctuation insights, AI data analysis reports, and a multi‑agent workflow for generating analytical reports within a gaming BI platform.

AIAnomaly DetectionBusiness Intelligence
0 likes · 12 min read
AI-Powered Anomaly Diagnosis and Root Cause Analysis for Gaming Business Intelligence
Continuous Delivery 2.0
Continuous Delivery 2.0
Jul 1, 2024 · Artificial Intelligence

How Meta Uses Llama2 to Accelerate Incident Response and Root‑Cause Analysis in AIOps

This article explains how Meta applies AI, specifically a fine‑tuned Llama2 model, to improve AIOps by automating incident monitoring, providing real‑time summaries, assisting responders with contextual information, and efficiently narrowing down root‑cause changes, ultimately reducing incident resolution time from hours to minutes.

AIAIOpsLlama2
0 likes · 13 min read
How Meta Uses Llama2 to Accelerate Incident Response and Root‑Cause Analysis in AIOps
Efficient Ops
Efficient Ops
Jun 20, 2024 · Operations

How Intelligent Ops Platforms Transform Distributed Banking Systems

This article explains how Chinese commercial banks are adopting intelligent operation platforms to collect, analyze, and visualize distributed system data in real time, enabling rapid root‑cause detection, full‑link tracing, and automated solution recommendations for complex financial services.

Intelligent Monitoringbankingdistributed systems
0 likes · 8 min read
How Intelligent Ops Platforms Transform Distributed Banking Systems
Qunar Tech Salon
Qunar Tech Salon
May 13, 2024 · Operations

Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks

This article details the investigation of sporadic interface timeouts in the Sirius real‑time pricing service, revealing a weekly pattern linked to RAID controller consistency checks that cause IO spikes, logback queue blockage, and ultimately Dubbo client‑side timeouts, and proposes mitigation steps and general performance‑troubleshooting guidelines.

LogbackMonitoringRAID
0 likes · 22 min read
Root Cause Analysis of Intermittent Timeout Issues in the Sirius Service Caused by RAID Card Consistency Checks
Wukong Talks Architecture
Wukong Talks Architecture
Apr 15, 2024 · Operations

Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures

On April 8, a Tencent Cloud API outage caused console login failures for nearly 2,000 customers, affecting several dependent services for 87 minutes, and the detailed root‑cause analysis and subsequent improvement actions are presented to enhance system resilience and change management.

APIIncident ResponseTencent Cloud
0 likes · 8 min read
Post‑mortem of the April 8 Tencent Cloud API Outage and Improvement Measures
High Availability Architecture
High Availability Architecture
Jan 9, 2024 · Operations

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

This article presents Meituan's two‑year exploration of AIOps in incident management, detailing risk‑prevention change detection, real‑time anomaly discovery, automated root‑cause diagnosis, multi‑dimensional KPI analysis, and similar‑event recommendation, while sharing architectural designs, algorithmic techniques, performance results, and future directions.

AIOpsAnomaly DetectionNLP
0 likes · 24 min read
AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation
Bilibili Tech
Bilibili Tech
Dec 15, 2023 · Operations

Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis

Bilibili revamped its alert monitoring platform to meet rapid growth, focusing on effectiveness, timeliness, and coverage; it introduced a closed‑loop design and governance that cut weekly alerts by 90%, built a knowledge‑graph root‑cause system achieving 87.9% accuracy with sub‑minute latency, and integrated AIOps for ongoing refinement.

AIOpsBilibiliSRE
0 likes · 21 min read
Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis
Qunar Tech Salon
Qunar Tech Salon
Nov 22, 2023 · Operations

Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis

This article details Qunar's comprehensive overhaul of its monitoring platform—introducing second‑level metrics, redesigning storage with VictoriaMetrics, optimizing client and server data collection, and building a root‑cause analysis tool—to dramatically reduce order‑related fault discovery time from minutes to under one minute.

MicroservicesMonitoringTSDB
0 likes · 22 min read
Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis
Efficient Ops
Efficient Ops
Nov 15, 2023 · Operations

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

This article describes how Huya built a unified metadata platform to break data silos across its numerous operations systems, enabling standardized data ingestion, association, visualization and analysis that improve resource governance, root‑cause diagnosis, and overall cost‑control for SRE teams.

SREgraph databasemetadata platform
0 likes · 13 min read
How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs
Selected Java Interview Questions
Selected Java Interview Questions
Oct 28, 2023 · Backend Development

Analyzing and Resolving an R2M Cache Usage Alert Before the 618 Promotion

This article walks through a real‑world R2M (Redis‑like) cache alert, detailing the email notification, large‑key analysis, code inspection, root‑cause identification, and both immediate and long‑term solutions that reduced cache usage by over 97% and prevented future incidents.

Cache OptimizationPerformance TuningRedis
0 likes · 12 min read
Analyzing and Resolving an R2M Cache Usage Alert Before the 618 Promotion
Ctrip Technology
Ctrip Technology
Oct 19, 2023 · Artificial Intelligence

Anomaly Detection and Root Cause Analysis System for Ctrip Train Ticket Business Metrics

This article presents an AI‑driven system that automatically detects anomalies in over 1,000 Ctrip train‑ticket business metrics using six unsupervised algorithms and locates their root causes through a hard‑voting ensemble of four specialized methods, demonstrating practical results and future enhancements.

Anomaly DetectionCtriproot cause analysis
0 likes · 18 min read
Anomaly Detection and Root Cause Analysis System for Ctrip Train Ticket Business Metrics
360 Tech Engineering
360 Tech Engineering
Oct 8, 2023 · Fundamentals

Data Anomaly Analysis: Methods, Process, and Case Studies

This article systematically outlines the thinking, step‑by‑step process, and practical methods for identifying and diagnosing data anomalies, and illustrates the approach with three detailed case studies covering video playback spikes, app retention drops, and community conversion declines.

Anomaly DetectionBusiness IntelligenceCase Study
0 likes · 16 min read
Data Anomaly Analysis: Methods, Process, and Case Studies
Qunar Tech Salon
Qunar Tech Salon
Sep 28, 2023 · Operations

Automated Root Cause Analysis for Flight Ticket Transaction Interception at Qunar: Design, Algorithm, and Performance Optimizations

This article describes how Qunar implemented an automated root‑cause analysis system for flight‑ticket transaction interception, detailing the problem background, system research, a custom algorithm focusing on explanatory power, performance optimizations that reduced analysis time from five minutes to under ten seconds, and the resulting operational improvements.

AutomationPerformance Optimizationalgorithm
0 likes · 13 min read
Automated Root Cause Analysis for Flight Ticket Transaction Interception at Qunar: Design, Algorithm, and Performance Optimizations
Ximalaya Technology Team
Ximalaya Technology Team
Sep 13, 2023 · Operations

Cache Instance Failure Incident Analysis and Root Cause Investigation

During a night‑time outage, a XCache (Codis + Pika) instance hung due to massive write load triggering low‑level protection, causing Sentinel to switch masters; the proxy’s accept queue filled with timed‑out sockets, blocking new connections, so scaling the proxy layer and expanding capacity restored service while prompting automation, health‑check, and queue‑overflow alerts.

CacheIncidentProxy
0 likes · 7 min read
Cache Instance Failure Incident Analysis and Root Cause Investigation
Qunar Tech Salon
Qunar Tech Salon
Jul 12, 2023 · Operations

Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

This article describes Qunar's comprehensive root cause analysis platform, detailing its background, data-driven fault categorization, architecture—including trace, runtime, middleware, and event analysis modules—and demonstrates its high accuracy and practical impact on reducing incident resolution times across microservice services.

DevOpsMicroservicesMonitoring
0 likes · 20 min read
Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis
Didi Tech
Didi Tech
Jul 4, 2023 · Cloud Native

eBPF Technology and Its Application in Didi's Cloud-Native Observability: HuaTuo Platform Practice

eBPF, a safe, high‑performance Linux kernel extension evolving from the 1993 Berkeley Packet Filter to modern dynamic tracing, underpins Didi’s HuaTuo platform, which consolidates bytecode management, fast data processing, stability self‑healing, and container insight to solve traffic replay, topology, security, and root‑cause analysis challenges across cloud‑native services, with plans to broaden business use and community collaboration.

Container SecurityHuaTuoKernel Tracing
0 likes · 12 min read
eBPF Technology and Its Application in Didi's Cloud-Native Observability: HuaTuo Platform Practice
DataFunSummit
DataFunSummit
Jun 2, 2023 · Artificial Intelligence

Knowledge Graph–Based Root Cause Analysis for Intelligent Manufacturing

This article explains how knowledge‑graph technology combined with artificial‑intelligence methods can enhance intelligent manufacturing by improving quality and reliability through advanced root‑cause analysis, detailing development trends, analytical techniques, challenges, practical frameworks, and real‑world case studies.

Artificial Intelligencebig dataintelligent manufacturing
0 likes · 17 min read
Knowledge Graph–Based Root Cause Analysis for Intelligent Manufacturing