Tagged articles
83 articles
Page 1 of 1
Woodpecker Software Testing
Woodpecker Software Testing
May 12, 2026 · Operations

How AI Cut CI/CD Build Time from 12 Minutes to 98 Seconds in a FinTech Team

A FinTech team's CI pipeline saw build time jump to 12 minutes 37 seconds and test failures rise to 18%, but after deploying a lightweight AI analysis engine the hidden JUnit parameterized test caused resource contention was identified, prioritized fixes were generated, and overall build duration was reduced to under two minutes.

AIDevOpsPerformance Optimization
0 likes · 9 min read
How AI Cut CI/CD Build Time from 12 Minutes to 98 Seconds in a FinTech Team
Woodpecker Software Testing
Woodpecker Software Testing
Apr 15, 2026 · Artificial Intelligence

How AI Testing Tools Redefine Performance Optimization: A New Paradigm

Amid exploding large‑model deployments, AI teams struggle with slow test feedback, but AI‑native testing tools—through intelligent load modeling, inference‑layer root‑cause analysis, and self‑healing loops—demonstrate concrete latency reductions, resource savings, and faster issue remediation.

AI testingMLOpsObservability
0 likes · 6 min read
How AI Testing Tools Redefine Performance Optimization: A New Paradigm
DevOps Coach
DevOps Coach
Mar 31, 2026 · Operations

How AI‑Driven Observability Can Cut MTTR: A 12‑Step Investigation Framework

This article explains how modern SRE teams can combine AI‑assisted observability with structured critical thinking to build a 12‑step investigation model that accelerates fault detection, hypothesis generation, telemetry validation, root‑cause analysis, and automated remediation, ultimately reducing MTTR and improving reliability.

AIObservabilityOperations
0 likes · 9 min read
How AI‑Driven Observability Can Cut MTTR: A 12‑Step Investigation Framework
DevOps Coach
DevOps Coach
Mar 26, 2026 · Operations

Can an AI Agent Replace Your SRE Night‑Shift? Inside Google’s Remote MCP‑Powered Autonomous SRE Agent

The article examines the chronic pain points of on‑call SRE teams—alert fatigue, long MTTR, inconsistent RCA, and communication bottlenecks—and presents a detailed, four‑layer architecture that uses Google’s Remote MCP server and an AI‑driven autonomous SRE agent to automate log retrieval, knowledge lookup, root‑cause analysis, and stakeholder notifications, dramatically improving reliability and efficiency.

Google CloudMCPOperations
0 likes · 21 min read
Can an AI Agent Replace Your SRE Night‑Shift? Inside Google’s Remote MCP‑Powered Autonomous SRE Agent
Woodpecker Software Testing
Woodpecker Software Testing
Mar 5, 2026 · Artificial Intelligence

How AI Is Transforming Regression Testing: Current Practices and Future Outlook

The article examines how AI-driven techniques are reshaping regression testing—from intelligent test case selection and self‑healing UI scripts to root‑cause analysis and risk prediction—illustrating real‑world results from fintech, automotive, and government projects and outlining the next three years of evolution.

AIRoot Cause AnalysisSelf-Healing UI
0 likes · 7 min read
How AI Is Transforming Regression Testing: Current Practices and Future Outlook
Raymond Ops
Raymond Ops
Jan 28, 2026 · Artificial Intelligence

From Alert Storms to Smart Ops: Unlocking AIOps for Modern IT Operations

This guide walks through the evolution from noisy alert storms to intelligent AIOps, covering AIOps fundamentals, why it matters now, core capabilities like anomaly detection, root‑cause analysis, capacity forecasting and self‑healing, a practical implementation roadmap, toolchain suggestions, common pitfalls, and future trends.

Capacity PredictionRoot Cause Analysisaiops
0 likes · 22 min read
From Alert Storms to Smart Ops: Unlocking AIOps for Modern IT Operations
Huya Tech Engineering
Huya Tech Engineering
Nov 28, 2025 · Operations

How LLMs Accelerate Root‑Cause Diagnosis in Large‑Scale Microservices

By abstracting a massive microservice system as a dynamic multi‑layer graph and integrating large language models, the article outlines three evolution stages—from manual expert debugging to rule‑based AIOps and finally LLM‑driven cognitive reasoning—detailing practical workflows, context engineering, and real‑world case studies that dramatically improve MTTR and accuracy.

Context EngineeringLLMMicroservices
0 likes · 20 min read
How LLMs Accelerate Root‑Cause Diagnosis in Large‑Scale Microservices
Instant Consumer Technology Team
Instant Consumer Technology Team
Nov 3, 2025 · Artificial Intelligence

Large Language Models Power Big Data SRE Knowledge & Root‑Cause Automation

Facing the growing complexity of big‑data platforms, the SRE team adopted large‑language‑model agents to automate knowledge management and root‑cause analysis, employing Retrieval‑Augmented Generation, a vector store, and the Model Context Protocol to enable intelligent, scalable, and efficient incident diagnosis and resolution.

AIMCPRAG
0 likes · 12 min read
Large Language Models Power Big Data SRE Knowledge & Root‑Cause Automation
Wukong Talks Architecture
Wukong Talks Architecture
Sep 22, 2025 · Databases

How AI‑Powered AIOps Transforms TiDB Database Operations

This article explores how integrating AI‑driven AIOps with the TiDB distributed database can automate monitoring, enable proactive anomaly detection, streamline root‑cause analysis, and optimize capacity planning, ultimately shifting database operations from manual firefighting to intelligent, data‑driven management.

Database operationsRoot Cause AnalysisTiDB
0 likes · 12 min read
How AI‑Powered AIOps Transforms TiDB Database Operations
Ops Community
Ops Community
Sep 16, 2025 · Operations

Mastering SRE: Fast Incident Response and Prevention Strategies

This guide walks SRE engineers through a complete incident lifecycle—preventive multi‑layer monitoring, chaos‑testing drills, rapid 10‑minute response tactics, systematic root‑cause analysis, effective communication roles, post‑mortem reviews, and practical case studies—helping teams minimize downtime and business loss.

Root Cause AnalysisSREincident management
0 likes · 11 min read
Mastering SRE: Fast Incident Response and Prevention Strategies
MaGe Linux Operations
MaGe Linux Operations
Sep 12, 2025 · Operations

From Alert Storms to Intelligent Ops: A Practical AIOps Journey

This article explores how AIOps transforms traditional IT operations by using AI for anomaly detection, root‑cause analysis, capacity forecasting, and self‑healing, offering a step‑by‑step roadmap, real‑world code examples, toolchain recommendations, common pitfalls, and future trends for building intelligent, automated operations.

Root Cause Analysisaiopsanomaly detection
0 likes · 24 min read
From Alert Storms to Intelligent Ops: A Practical AIOps Journey
Data Party THU
Data Party THU
Jul 31, 2025 · Industry Insights

How a 30‑Minute Steel Melt Can Unlock a 10% Production Boost – Insights from Industrial Data Analysis

The article explores real‑world industrial cases—from steel furnace timing and historic lithography to modern manufacturing—showing how continuous improvement, root‑cause analysis, and careful handling of correlation versus causation can reveal hidden inefficiencies, while highlighting the limits of traditional statistics and the emerging role of AI in industrial data analytics.

AIBig DataContinuous Improvement
0 likes · 14 min read
How a 30‑Minute Steel Melt Can Unlock a 10% Production Boost – Insights from Industrial Data Analysis
Ops Development Stories
Ops Development Stories
Jul 1, 2025 · Artificial Intelligence

From Lean to AIOps: How AI is Transforming Modern Operations

This comprehensive guide walks through the evolution from Lean and Agile practices to DevOps and finally AIOps, explaining core concepts, key algorithms, the role of large language models, RAG‑based root‑cause analysis, and practical implementation steps for intelligent operations.

LeanRAGRoot Cause Analysis
0 likes · 19 min read
From Lean to AIOps: How AI is Transforming Modern Operations
Efficient Ops
Efficient Ops
Apr 22, 2025 · Operations

How AI Agents Are Transforming IT Operations and Fault Management

This article explores how AI agents powered by large models can predict failures, perform root‑cause analysis, enhance knowledge‑based Q&A, automate change releases, and enable intelligent decision‑making, dramatically improving efficiency and reliability in modern IT operations.

AI OpsRoot Cause Analysisfault prediction
0 likes · 7 min read
How AI Agents Are Transforming IT Operations and Fault Management
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 2, 2025 · Operations

Mastering Error and Latency Diagnosis for Online Applications

This article presents a systematic root‑cause diagnosis framework for online applications, covering how to identify and resolve both error ("wrong") and performance ("slow") problems using trace links, associated data, high‑quality observability, and large‑language‑model‑driven intelligence.

Performance MonitoringRoot Cause AnalysisTrace Analysis
0 likes · 12 min read
Mastering Error and Latency Diagnosis for Online Applications
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Oct 9, 2024 · Operations

AIOps Implementation at Xiaohongshu: Fault Localization and Intelligent Operations

Xiaohongshu’s AIOps initiative builds a four‑layer framework that leverages machine‑learning‑driven anomaly detection, causal analysis, and trace‑based fault localization to automatically identify root‑cause services in micro‑service environments, achieving over 80 % accuracy across 1000 daily diagnoses while guiding future enhancements in change correlation and automated remediation.

DevOpsFault LocalizationIntelligent Operations
0 likes · 28 min read
AIOps Implementation at Xiaohongshu: Fault Localization and Intelligent Operations
Architect
Architect
Sep 27, 2024 · Artificial Intelligence

How AI Detects and Diagnoses Anomalies in Ctrip Train Ticket Metrics

This article presents a comprehensive AI‑driven system for automatically detecting anomalies in over 1,000 Ctrip train‑ticket business metrics and pinpointing their root causes, detailing the background, unsupervised algorithms, detection and attribution pipelines, practical results, and future improvements.

AI anomaly detectionCtripRoot Cause Analysis
0 likes · 21 min read
How AI Detects and Diagnoses Anomalies in Ctrip Train Ticket Metrics
Huolala Tech
Huolala Tech
Sep 19, 2024 · Operations

How to Build a Team‑Wide Incident Response Platform for Seamless Online Ops

This article details XiaoBai's journey from struggling with ad‑hoc incident handling to designing a comprehensive platform that captures anomaly data, diagnoses root causes, and enables every team member to respond quickly and consistently, ultimately achieving a "everyone can respond" operation model.

BackendRoot Cause Analysisincident response
0 likes · 14 min read
How to Build a Team‑Wide Incident Response Platform for Seamless Online Ops
Tech Architecture Stories
Tech Architecture Stories
Sep 14, 2024 · Operations

Why Most Incident Postmortems Miss the Mark and How to Fix Them

This article reveals three common pitfalls in daily incident postmortems—overlooking minor failures, confusing root causes with triggers, and weak improvement actions—and offers practical steps like the 5 Whys method and essential corrective measures to truly reduce online outages.

Continuous ImprovementRoot Cause AnalysisSRE
0 likes · 5 min read
Why Most Incident Postmortems Miss the Mark and How to Fix Them
Continuous Delivery 2.0
Continuous Delivery 2.0
Jul 1, 2024 · Artificial Intelligence

How Meta Uses Llama2 to Accelerate Incident Response and Root‑Cause Analysis in AIOps

This article explains how Meta applies AI, specifically a fine‑tuned Llama2 model, to improve AIOps by automating incident monitoring, providing real‑time summaries, assisting responders with contextual information, and efficiently narrowing down root‑cause changes, ultimately reducing incident resolution time from hours to minutes.

AILlama2Meta
0 likes · 13 min read
How Meta Uses Llama2 to Accelerate Incident Response and Root‑Cause Analysis in AIOps
Efficient Ops
Efficient Ops
Jun 20, 2024 · Operations

How Intelligent Ops Platforms Transform Distributed Banking Systems

This article explains how Chinese commercial banks are adopting intelligent operation platforms to collect, analyze, and visualize distributed system data in real time, enabling rapid root‑cause detection, full‑link tracing, and automated solution recommendations for complex financial services.

BankingDistributed SystemsRoot Cause Analysis
0 likes · 8 min read
How Intelligent Ops Platforms Transform Distributed Banking Systems
dbaplus Community
dbaplus Community
Jan 29, 2024 · Artificial Intelligence

How Meituan Uses AIOps to Revolutionize Incident Management

This article details Meituan's two‑year exploration of AIOps for incident management, covering the challenges of massive, real‑time operational data, the AI‑driven modules for risk prevention, fault detection, diagnosis, and similar‑incident recommendation, and future directions such as intelligent log detection and change recognition.

OperationsRoot Cause Analysisaiops
0 likes · 22 min read
How Meituan Uses AIOps to Revolutionize Incident Management
High Availability Architecture
High Availability Architecture
Jan 9, 2024 · Operations

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

This article presents Meituan's two‑year exploration of AIOps in incident management, detailing risk‑prevention change detection, real‑time anomaly discovery, automated root‑cause diagnosis, multi‑dimensional KPI analysis, and similar‑event recommendation, while sharing architectural designs, algorithmic techniques, performance results, and future directions.

NLPOperationsRoot Cause Analysis
0 likes · 24 min read
AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation
Meituan Technology Team
Meituan Technology Team
Dec 21, 2023 · Operations

AIOps for Incident Management: Practices and Insights from Meituan

Meituan’s service‑operations team applies AIOps across prevention, detection, and post‑incident stages—using change‑risk analysis, real‑time graph‑based anomaly detection, similarity‑driven root‑cause diagnosis, and NLP‑powered incident recommendation—to achieve sub‑second detection, high precision, 28% faster fault handling, and plans for intelligent log and change recognition.

OperationsRoot Cause Analysisaiops
0 likes · 24 min read
AIOps for Incident Management: Practices and Insights from Meituan
Bilibili Tech
Bilibili Tech
Dec 15, 2023 · Operations

Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis

Bilibili revamped its alert monitoring platform to meet rapid growth, focusing on effectiveness, timeliness, and coverage; it introduced a closed‑loop design and governance that cut weekly alerts by 90%, built a knowledge‑graph root‑cause system achieving 87.9% accuracy with sub‑minute latency, and integrated AIOps for ongoing refinement.

Alert MonitoringBilibiliRoot Cause Analysis
0 likes · 21 min read
Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis
Efficient Ops
Efficient Ops
Nov 15, 2023 · Operations

How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs

This article describes how Huya built a unified metadata platform to break data silos across its numerous operations systems, enabling standardized data ingestion, association, visualization and analysis that improve resource governance, root‑cause diagnosis, and overall cost‑control for SRE teams.

Root Cause AnalysisSREgraph database
0 likes · 13 min read
How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs
Selected Java Interview Questions
Selected Java Interview Questions
Oct 28, 2023 · Backend Development

Analyzing and Resolving an R2M Cache Usage Alert Before the 618 Promotion

This article walks through a real‑world R2M (Redis‑like) cache alert, detailing the email notification, large‑key analysis, code inspection, root‑cause identification, and both immediate and long‑term solutions that reduced cache usage by over 97% and prevented future incidents.

Backend DevelopmentRoot Cause Analysiscache optimization
0 likes · 12 min read
Analyzing and Resolving an R2M Cache Usage Alert Before the 618 Promotion

How Transparent AI Boosts Trust in AIOps: Explainable Root‑Cause Solutions

This article examines the rapid growth of the Chinese IT operations market, explains why AIOps faces trust challenges due to opaque deep‑learning models, and presents AsiaInfo's transparent‑model and post‑hoc explanation engine together with three concrete explainable root‑cause analysis methods, concluding with future outlooks for trustworthy AIOps.

AI trustOperationsRoot Cause Analysis
0 likes · 13 min read
How Transparent AI Boosts Trust in AIOps: Explainable Root‑Cause Solutions
Ctrip Technology
Ctrip Technology
Oct 19, 2023 · Artificial Intelligence

Anomaly Detection and Root Cause Analysis System for Ctrip Train Ticket Business Metrics

This article presents an AI‑driven system that automatically detects anomalies in over 1,000 Ctrip train‑ticket business metrics using six unsupervised algorithms and locates their root causes through a hard‑voting ensemble of four specialized methods, demonstrating practical results and future enhancements.

CtripRoot Cause AnalysisTime Series
0 likes · 18 min read
Anomaly Detection and Root Cause Analysis System for Ctrip Train Ticket Business Metrics
360 Tech Engineering
360 Tech Engineering
Oct 8, 2023 · Fundamentals

Data Anomaly Analysis: Methods, Process, and Case Studies

This article systematically outlines the thinking, step‑by‑step process, and practical methods for identifying and diagnosing data anomalies, and illustrates the approach with three detailed case studies covering video playback spikes, app retention drops, and community conversion declines.

Business IntelligenceRoot Cause Analysisanomaly detection
0 likes · 16 min read
Data Anomaly Analysis: Methods, Process, and Case Studies
Ximalaya Technology Team
Ximalaya Technology Team
Sep 13, 2023 · Operations

Cache Instance Failure Incident Analysis and Root Cause Investigation

During a night‑time outage, a XCache (Codis + Pika) instance hung due to massive write load triggering low‑level protection, causing Sentinel to switch masters; the proxy’s accept queue filled with timed‑out sockets, blocking new connections, so scaling the proxy layer and expanding capacity restored service while prompting automation, health‑check, and queue‑overflow alerts.

CacheIncidentOperations
0 likes · 7 min read
Cache Instance Failure Incident Analysis and Root Cause Investigation
Qunar Tech Salon
Qunar Tech Salon
Jul 12, 2023 · Operations

Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis

This article describes Qunar's comprehensive root cause analysis platform, detailing its background, data-driven fault categorization, architecture—including trace, runtime, middleware, and event analysis modules—and demonstrates its high accuracy and practical impact on reducing incident resolution times across microservice services.

DevOpsMicroservicesObservability
0 likes · 20 min read
Design and Implementation of Qunar's Root Cause Analysis System for Microservice Fault Diagnosis
Didi Tech
Didi Tech
Jul 4, 2023 · Cloud Native

eBPF Technology and Its Application in Didi's Cloud-Native Observability: HuaTuo Platform Practice

eBPF, a safe, high‑performance Linux kernel extension evolving from the 1993 Berkeley Packet Filter to modern dynamic tracing, underpins Didi’s HuaTuo platform, which consolidates bytecode management, fast data processing, stability self‑healing, and container insight to solve traffic replay, topology, security, and root‑cause analysis challenges across cloud‑native services, with plans to broaden business use and community collaboration.

Container SecurityHuatuoObservability
0 likes · 12 min read
eBPF Technology and Its Application in Didi's Cloud-Native Observability: HuaTuo Platform Practice
DataFunSummit
DataFunSummit
Jun 2, 2023 · Artificial Intelligence

Knowledge Graph–Based Root Cause Analysis for Intelligent Manufacturing

This article explains how knowledge‑graph technology combined with artificial‑intelligence methods can enhance intelligent manufacturing by improving quality and reliability through advanced root‑cause analysis, detailing development trends, analytical techniques, challenges, practical frameworks, and real‑world case studies.

Big DataKnowledge GraphRoot Cause Analysis
0 likes · 17 min read
Knowledge Graph–Based Root Cause Analysis for Intelligent Manufacturing
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
May 22, 2023 · Artificial Intelligence

How Microsoft Leverages LLMs to Auto‑Generate Cloud Incident Root Causes and Fixes

Microsoft researchers fine‑tuned GPT‑3.x models with LoRA on over 40,000 cloud incident records, evaluated them with six NLP metrics and human interviews, and found that LLMs can generate root‑cause analyses and mitigation steps comparable to BERT models, especially for machine‑detected failures.

AI for operationsGPT-3LLM
0 likes · 8 min read
How Microsoft Leverages LLMs to Auto‑Generate Cloud Incident Root Causes and Fixes
ITPUB
ITPUB
Apr 23, 2023 · Cloud Native

How Kindling Leverages eBPF to Reach 1‑5‑10 Observability Targets

This article examines the difficulty of achieving the 1‑5‑10 observability goal, reviews current tracing, logging, and metrics tools, introduces the open‑source Kindling project’s eBPF‑based trace‑profiling approach, and walks through several real‑world use cases that demonstrate faster root‑cause analysis in cloud‑native environments.

KindlingObservabilityRoot Cause Analysis
0 likes · 16 min read
How Kindling Leverages eBPF to Reach 1‑5‑10 Observability Targets
DevOps
DevOps
Jan 18, 2023 · Operations

Qualitative Analysis as a Metric for Software Quality Measurement

The article explains how qualitative analysis serves as a measurable metric throughout the software lifecycle, outlines five key qualitative methods—interviews, root‑cause analysis, maturity assessment, reviews, and post‑mortems—and demonstrates their practical application for continuous quality improvement.

Maturity AssessmentOperationsRoot Cause Analysis
0 likes · 8 min read
Qualitative Analysis as a Metric for Software Quality Measurement
Efficient Ops
Efficient Ops
Jan 16, 2023 · Operations

How China Mobile’s Centralized AIOps Platform Achieved Top‑Tier Evaluation

This article details China Mobile Information's interview about their centralized AIOps platform, the recent excellent‑level assessment by the China Academy of Information and Communications Technology, the system's key modules, future plans, and the broader significance of AI‑driven IT operations.

AutomationIT OperationsRoot Cause Analysis
0 likes · 11 min read
How China Mobile’s Centralized AIOps Platform Achieved Top‑Tier Evaluation
Data Thinking Notes
Data Thinking Notes
Jan 10, 2023 · Big Data

How Bilibili Built a Scalable Data Quality Platform for Billions of Events

This article describes Bilibili’s data quality platform, outlining its background, objectives, theoretical models, workflow stages (recording, checking, alerting), DSL for metrics, root‑cause analysis, scheduling strategies, heterogeneous source integration, rule coverage, intelligent monitoring, and future plans to achieve automated, real‑time, high‑reliability data assurance for massive daily workloads.

AutomationBig DataData Quality
0 likes · 21 min read
How Bilibili Built a Scalable Data Quality Platform for Billions of Events
vivo Internet Technology
vivo Internet Technology
Jan 4, 2023 · Artificial Intelligence

Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis

The article describes a root‑cause localization algorithm implemented in vivo’s monitoring platform that automatically analyzes latency spikes by splitting service timelines, computing variance, clustering results with K‑means, and recursively tracing downstream services, achieving over 85 % accuracy for dependency failures while still requiring human verification and outlining future AI‑driven enhancements.

Fault LocalizationK-MeansRoot Cause Analysis
0 likes · 13 min read
Root Cause Localization Algorithm and Its Implementation for Service Fault Diagnosis
Efficient Ops
Efficient Ops
Dec 30, 2022 · Operations

How China Agricultural Bank Earned Top AIOps Rating – Inside the Evaluation

An interview with senior leaders of China Agricultural Bank reveals how their AIOps‑driven operations platform achieved an Excellent rating in the CAICT root‑cause analysis module, showcasing the bank’s intelligent operations strategy, implementation details, and future plans for expanding AI‑based monitoring across cloud and micro‑service environments.

AIDigital TransformationIT Operations
0 likes · 9 min read
How China Agricultural Bank Earned Top AIOps Rating – Inside the Evaluation
HelloTech
HelloTech
Nov 22, 2022 · Operations

Guidelines for Incident Postmortem and Fault Review

The incident postmortem guideline advocates a dialectical view of failures, rapid low‑severity recovery, and a structured process—covering background, impact scope, timeline replay, deep root‑cause analysis, SMART improvement actions, responsibility assignment, and PDCA‑validated closure—to enhance system resilience, team anti‑fragility, and knowledge sharing.

MTBFMTTROperations
0 likes · 15 min read
Guidelines for Incident Postmortem and Fault Review
ITPUB
ITPUB
Nov 5, 2022 · Big Data

How Bilibili Builds a Scalable, Automated, and Intelligent Data Quality Platform

This article explains how Bilibili’s data quality team designs a process‑driven, automated, and AI‑enhanced platform that monitors billions of records daily, defines quality metrics such as completeness and consistency, integrates heterogeneous data sources, and provides root‑cause analysis and real‑time alerting to ensure trustworthy data for its massive user base.

Data QualityRoot Cause AnalysisScheduling
0 likes · 19 min read
How Bilibili Builds a Scalable, Automated, and Intelligent Data Quality Platform
Bilibili Tech
Bilibili Tech
Nov 1, 2022 · Big Data

Design and Implementation of a Data Quality Platform for Large-Scale Data Processing

Bilibili built a scalable data‑quality platform that records metrics from heterogeneous sources, checks them with a rich DSL, alerts once with root‑cause analysis, and uses event‑driven and time‑window scheduling, automated workflows, and intelligent monitoring to ensure real‑time, accurate, trustworthy data for petabyte‑scale processing.

AutomationData QualityRoot Cause Analysis
0 likes · 20 min read
Design and Implementation of a Data Quality Platform for Large-Scale Data Processing
DataFunSummit
DataFunSummit
Aug 30, 2022 · Operations

CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms

This article presents the design, implementation, and evaluation of CloudRCA, an intelligent root cause analysis framework for Alibaba Cloud's big‑data computing services, detailing challenges such as heterogeneous data, sample imbalance, and real‑time constraints, and describing the multi‑stage data processing, hierarchical Bayesian modeling, and deployment results that reduce MTTR by 20%.

Big DataOperationsRoot Cause Analysis
0 likes · 16 min read
CloudRCA: A Root Cause Analysis Framework for Cloud Computing Platforms
High Availability Architecture
High Availability Architecture
Jul 12, 2022 · Operations

Postmortem of the July 13, 2021 Bilibili SLB Outage: Timeline, Root Cause, and Improvement Measures

This article details the July 13, 2021 Bilibili service outage caused by a Lua‑based SLB CPU spike, describing the incident timeline, root‑cause analysis of a weight‑zero bug, mitigation steps including new SLB deployment, and the subsequent operational and architectural improvements.

Load BalancerLuaRoot Cause Analysis
0 likes · 17 min read
Postmortem of the July 13, 2021 Bilibili SLB Outage: Timeline, Root Cause, and Improvement Measures
Bilibili Tech
Bilibili Tech
Jul 12, 2022 · Operations

Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements

On July 13 2021 Bilibili’s L7 SLB crashed when a recent Lua deployment set a balancer weight to the string “0”, producing a NaN value that triggered an infinite loop and 100 % CPU, prompting emergency restarts, a fresh cluster rollout, and long‑term safeguards such as automated provisioning, stricter Lua validation, and enhanced multi‑active disaster‑recovery processes.

Load BalancerRoot Cause AnalysisSLB
0 likes · 17 min read
Bilibili SLB Outage Postmortem (July 13, 2021): Timeline, Root Cause, and Improvements
Architecture Digest
Architecture Digest
Jul 12, 2022 · Big Data

Intelligent Gray Release Data System for Vivo Game Center: Methodology and Solutions

This article presents Vivo Game Center's end‑to‑end intelligent gray‑release data system, detailing its experimental mindset, statistical methods, data models, and product solutions that ensure scientific version evaluation, project progress, and rapid issue closure through root‑cause analysis and full‑process automation.

A/B testingRoot Cause Analysisdata analysis
0 likes · 18 min read
Intelligent Gray Release Data System for Vivo Game Center: Methodology and Solutions
ITPUB
ITPUB
Jul 2, 2022 · Fundamentals

How Vivo Built an Intelligent Gray‑Release Data System for Faster, Scientific Game Updates

This article details Vivo Game Center's end‑to‑end intelligent gray‑release data framework—covering experiment design, statistical methods, data models, and automated product solutions—to ensure scientific version evaluation, accelerate project timelines, and quickly close the gray‑testing loop.

A/B testingData AnalyticsRoot Cause Analysis
0 likes · 16 min read
How Vivo Built an Intelligent Gray‑Release Data System for Faster, Scientific Game Updates
Meituan Technology Team
Meituan Technology Team
May 5, 2022 · Databases

Database Autonomy Service (DAS): Architecture, Design, and Implementation

The Database Autonomy Service (DAS) is a platform that uses big‑data, machine‑learning, and expert knowledge to automatically collect, compress, and analyze MySQL metrics, providing self‑service fault detection, root‑cause diagnosis, and security management, thereby reducing manual effort, shortening MTTR, and supporting Meituan’s rapid database growth.

AI-driven opsDatabase AutonomyPerformance Monitoring
0 likes · 20 min read
Database Autonomy Service (DAS): Architecture, Design, and Implementation
DataFunSummit
DataFunSummit
Dec 9, 2021 · Big Data

Diagnostic Analytics in Meituan Food Delivery: Methods and Case Studies

This talk by Meituan data analyst Wang Qing explains why diagnostic analytics is essential, outlines its methodology using logical trees and hypothesis-driven approaches, and presents two case studies—weather index modeling and an intelligent anomaly detection system—to illustrate how data-driven diagnosis can pinpoint root causes and improve decision‑making in online food delivery.

Data ScienceRoot Cause Analysisanomaly detection
0 likes · 19 min read
Diagnostic Analytics in Meituan Food Delivery: Methods and Case Studies
ITPUB
ITPUB
May 17, 2021 · Operations

How DBSCAN Clustering and Bayesian Inference Boost Root‑Cause Detection in Securities Trading Systems

This article describes how a Chinese securities firm applied big‑data‑driven clustering and Bayesian methods to automate root‑cause analysis of trading‑system anomalies, detailing the challenges, algorithmic designs, practical implementations, and evaluation results that demonstrate significant reductions in false alarms and faster recovery.

Bayesian inferenceOperationsRoot Cause Analysis
0 likes · 17 min read
How DBSCAN Clustering and Bayesian Inference Boost Root‑Cause Detection in Securities Trading Systems
dbaplus Community
dbaplus Community
May 16, 2021 · Operations

How DBSCAN Clustering and Bayesian Inference Enable Fast Root‑Cause Detection in Securities Trading Systems

This article details the challenges of root‑cause identification in high‑availability securities trading platforms and presents two intelligent‑operations solutions—DBSCAN‑based clustering and Bayesian inference—to quickly locate anomalies and improve recovery efficiency.

Bayesian inferenceDBSCANIntelligent Operations
0 likes · 17 min read
How DBSCAN Clustering and Bayesian Inference Enable Fast Root‑Cause Detection in Securities Trading Systems
Suning Technology
Suning Technology
Aug 29, 2020 · Artificial Intelligence

How AI Powers Large‑Scale Time Series Forecasting and Root‑Cause Analysis

This article describes Suning's AI‑driven end‑to‑end solution for massive time‑series monitoring, anomaly detection, forecasting with DeepAR, MQ‑RNN, MQ‑CNN, ensemble methods, root‑cause localization using Hotspot and Monte‑Carlo Tree Search, and the evolution of its large‑scale log analytics platform.

Deep LearningKnowledge GraphLog Analytics
0 likes · 17 min read
How AI Powers Large‑Scale Time Series Forecasting and Root‑Cause Analysis
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jun 20, 2020 · Fundamentals

Applying the 5Whys Method to Diagnose and Resolve Workplace Issues

The article illustrates how the 5Whys root‑cause analysis can turn everyday workplace conflicts—like missed deadlines and poor communication—into actionable improvement plans by repeatedly asking why, encouraging systematic task scheduling, and fostering a habit of deeper inquiry for lasting productivity gains.

5WhysMobile DevelopmentRoot Cause Analysis
0 likes · 9 min read
Applying the 5Whys Method to Diagnose and Resolve Workplace Issues
dbaplus Community
dbaplus Community
Apr 6, 2020 · Databases

How AI‑Driven Intelligent Ops Transform Database Management in Banking

This article examines the severe time‑critical pain points of bank database operations, explains why AI‑based intelligent ops are needed, describes the platform architecture, unsupervised algorithms (3σ, Isolation Forest, DBSCAN, Pearson, Apriori), and presents a real‑world case study that demonstrates anomaly detection, root‑cause analysis, and practical optimization recommendations.

Database operationsPythonRoot Cause Analysis
0 likes · 23 min read
How AI‑Driven Intelligent Ops Transform Database Management in Banking
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 6, 2020 · Operations

Google Incident Postmortem Checklist

The article presents a detailed Google‑derived post‑mortem checklist covering event data collection, root‑cause analysis, lessons learned, actionable improvement items, and review procedures to ensure systematic, non‑blame‑focused incident handling.

OperationsRoot Cause Analysisaction items
0 likes · 5 min read
Google Incident Postmortem Checklist
Efficient Ops
Efficient Ops
Feb 18, 2020 · Operations

How Intelligent Ops Transforms Monitoring: Multi‑Dimensional Anomaly Detection & Smart Alert Merging

This article presents the 2019 GOPS Global Operations Conference talk by Gong Cheng, detailing how intelligent monitoring leverages multi‑dimensional anomaly detection, machine‑learning‑based alert merging, knowledge‑graph construction, and root‑cause analysis to automate and improve large‑scale IT operations.

Knowledge GraphRoot Cause Analysisalert merging
0 likes · 22 min read
How Intelligent Ops Transforms Monitoring: Multi‑Dimensional Anomaly Detection & Smart Alert Merging
dbaplus Community
dbaplus Community
Feb 3, 2020 · Operations

Boosting Securities Ops with AI: A Practical Intelligent Operations Platform

This article presents a comprehensive study of applying AI, big‑data analytics, and automated pipelines to improve operational efficiency in the securities industry, detailing a custom intelligent ops platform, its layered architecture, and three real‑world scenarios—root‑cause analysis, knowledge‑base assistance, and capacity forecasting—along with experimental results and practical insights.

AICapacity PredictionIntelligent Operations
0 likes · 27 min read
Boosting Securities Ops with AI: A Practical Intelligent Operations Platform
58 Tech
58 Tech
Nov 4, 2019 · Operations

Intelligent Operations Practices: Multi‑Dimensional Anomaly Detection, Alarm Merging, Knowledge‑Graph Construction, and Root‑Cause Analysis

This article summarizes the keynote on intelligent operations presented at the 13th GOPS Global Operations Conference, covering multi‑dimensional anomaly detection, smart alarm aggregation, the construction of an operations knowledge graph, and AI‑driven root‑cause analysis techniques for large‑scale server environments.

Knowledge GraphOperationsRoot Cause Analysis
0 likes · 9 min read
Intelligent Operations Practices: Multi‑Dimensional Anomaly Detection, Alarm Merging, Knowledge‑Graph Construction, and Root‑Cause Analysis
Efficient Ops
Efficient Ops
Dec 11, 2018 · Operations

How Alibaba’s AI‑Powered Monitoring Tackles Complex Business Anomalies

In this talk, Alibaba senior tech expert Wang Zhaogang explains how intelligent monitoring, powered by machine‑learning algorithms and multi‑metric analysis, addresses the challenges of diverse business scenarios, enhances anomaly detection, improves root‑cause analysis, and shapes the future of smart operations.

OperationsRoot Cause Analysisanomaly detection
0 likes · 23 min read
How Alibaba’s AI‑Powered Monitoring Tackles Complex Business Anomalies
Efficient Ops
Efficient Ops
Aug 28, 2018 · Operations

How to Detect and Resolve Time‑Series Anomalies in Modern AIOps

This article explains practical approaches for time‑series anomaly detection, multi‑dimensional drill‑down analysis, alarm‑convergence root‑cause analysis, and future AIOps planning, combining statistical methods, unsupervised learning, and supervised models to improve monitoring accuracy and operational efficiency.

OperationsRoot Cause AnalysisUnsupervised Learning
0 likes · 20 min read
How to Detect and Resolve Time‑Series Anomalies in Modern AIOps
Efficient Ops
Efficient Ops
Apr 18, 2018 · Operations

Huawei’s Triple‑Play Model: Advancing AIOps for Massive K8s and Serverless

At the 9th Global Operations Conference, Huawei Cloud’s chief architect Cai Xiaogang presented a three‑pronged AIOps strategy that combines large‑scale Kubernetes management, causal tracing in Serverless environments, multi‑source RCA analysis, and clustering‑based black‑box network packet inspection, showcasing how academia‑industry collaboration accelerates cloud‑native operations.

KubernetesRoot Cause AnalysisServerless
0 likes · 8 min read
Huawei’s Triple‑Play Model: Advancing AIOps for Massive K8s and Serverless
Didi Tech
Didi Tech
Apr 16, 2018 · Fundamentals

A Structured Approach to Problem Solving and Architectural Thinking

The article presents a structured framework for problem solving and architectural thinking, defining problems as goal‑state gaps, warning against common pitfalls, introducing a “what‑how‑why” learning loop, detailing root‑cause analysis for anomalous issues and goal‑driven stakeholder mapping for improvement tasks, and emphasizing emotional intelligence in human‑centric solutions.

Learning LoopManagementRoot Cause Analysis
0 likes · 14 min read
A Structured Approach to Problem Solving and Architectural Thinking
dbaplus Community
dbaplus Community
Jan 15, 2018 · Operations

How JD Finance Achieves Real-Time Capacity Assessment and Smart Alerting

This article explains JD Finance's operational challenges in a rapidly expanding micro‑service environment and presents a comprehensive approach that combines offline and online load testing, precise capacity calculations, and intelligent root‑cause alert analysis using both rule‑based and machine‑learning techniques.

Load TestingOperationsRoot Cause Analysis
0 likes · 15 min read
How JD Finance Achieves Real-Time Capacity Assessment and Smart Alerting
dbaplus Community
dbaplus Community
Jan 1, 2018 · Big Data

How Vipshop Leverages Data Processing, Analytics, and Mining for Smarter Ops

This article summarizes Wu Xiaoguang's talk at Gdevops 2017, detailing how Vipshop integrates data processing, analysis, and mining technologies—such as Flume, Kafka, Spark, and custom scheduling—to improve operational decision‑making, performance monitoring, root‑cause analysis, and predictive modeling across its e‑commerce platform.

Big DataData AnalyticsOperations
0 likes · 23 min read
How Vipshop Leverages Data Processing, Analytics, and Mining for Smarter Ops
Efficient Ops
Efficient Ops
Jun 14, 2016 · Operations

Automate Fault Root‑Cause Detection in Massive IT Operations

This article explains how large‑scale internet companies can reduce alarm storms and speed up incident resolution by creating an operations ecosystem centered on automated fault root‑cause localization, detailing the challenges, architecture, decision‑tree algorithms, and a four‑step implementation guide.

AutomationIT infrastructureOperations
0 likes · 11 min read
Automate Fault Root‑Cause Detection in Massive IT Operations