Tagged articles

fault detection

16 articles · Page 1 of 1

Mar 20, 2026 · Cloud Native

When a Server Silently Crashes, How Long Can Your Cluster Survive? Inside the Heartbeat Failover Mechanism

The article explains how distributed systems detect silently dead nodes using heartbeat mechanisms—both push and pull models—covers trade‑offs between interval and timeout, introduces advanced detectors like Cassandra's Φ, gossip protocols, and quorum rules, and shows real‑world implementations in Kubernetes and etcd.

Cassandradistributed systemsfault detection

0 likes · 12 min read

When a Server Silently Crashes, How Long Can Your Cluster Survive? Inside the Heartbeat Failover Mechanism

dbaplus Community

Jul 24, 2025 · Operations

How Bilibili Scales Server Fault Management with Automated Detection and Repair

This article details Bilibili's approach to handling explosive growth in server count by classifying faults, identifying shortcomings of manual processes, and implementing an automated, end‑to‑end detection, rule‑based alerting, and repair workflow that combines in‑band and out‑of‑band data collection to achieve near‑perfect coverage and accuracy.

Data CenterMonitoringfault detection

0 likes · 17 min read

How Bilibili Scales Server Fault Management with Automated Detection and Repair

DataFunSummit

May 22, 2025 · Operations

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

This article presents Grab's Hugo platform, an automated fault‑detection and self‑healing system for over 4,000 data pipelines that combines multi‑source signal collection, intelligent diagnosis, layered auto‑repair, and a health API to dramatically improve data visibility, reduce manual intervention, and boost operational efficiency across the company.

AutomationBig DataDataOps

0 likes · 12 min read

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

Qunar Tech Salon

Nov 22, 2023 · Operations

Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis

This article details Qunar's comprehensive overhaul of its monitoring platform—introducing second‑level metrics, redesigning storage with VictoriaMetrics, optimizing client and server data collection, and building a root‑cause analysis tool—to dramatically reduce order‑related fault discovery time from minutes to under one minute.

MicroservicesMonitoringOperations

0 likes · 22 min read

Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis

Didi Tech

Sep 5, 2023 · Operations

Observability and Stability Engineering in Didi Ride‑Hailing Platform

At Didi, observability and stability engineering combine automated, AI‑driven alarm generation, distributed tracing, and ChatOps‑based fault handling to manage micro‑service complexity, massive traffic spikes, and cross‑region operations, emphasizing systematic investment, AIOps evolution, and a recruitment call for backend and test engineers.

AIOpsDidiObservability

0 likes · 16 min read

Observability and Stability Engineering in Didi Ride‑Hailing Platform

JD Tech Talk

Jan 8, 2021 · Artificial Intelligence

AIOps: Background, Scenarios, Capability Building, and Practical Implementation by JD Digital Operations Team

This article explains the evolution of IT operations toward AIOps, outlines its key scenarios, describes the team roles and capability‑building roadmap, and details JD Digital Operations' practical implementations—including fault detection, localization, and automated repair—leveraging AI, big data, and knowledge‑graph technologies.

AIOpsAutomationIT Operations

0 likes · 12 min read

AIOps: Background, Scenarios, Capability Building, and Practical Implementation by JD Digital Operations Team

High Availability Architecture

Oct 22, 2020 · Artificial Intelligence

AIOps at Meituan: Architecture, Design, and Practice of the Horae Time‑Series Anomaly Detection System

This article presents Meituan's AIOps exploration, focusing on the design and implementation of the Horae time‑series anomaly detection platform, covering background, technical roadmap, fault‑discovery workflow, time‑series classification, feature engineering, model training, real‑time detection, and future directions.

AIOpsHoraeMeituan

0 likes · 31 min read

AIOps at Meituan: Architecture, Design, and Practice of the Horae Time‑Series Anomaly Detection System

360 Quality & Efficiency

Feb 28, 2020 · Operations

External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies

The article details 360's external network quality monitoring system, explaining its background, real‑time detection features, CDN‑based source host selection, three‑layer architecture, data collection and storage pipelines, fault‑diagnosis strategies, and visualization approaches for rapid network fault localization.

AlertingCDNfault detection

0 likes · 10 min read

External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies

Tencent Cloud Developer

Mar 29, 2019 · Databases

Design of High‑Availability System and Fast Recovery in Tencent CynosDB

Tencent CynosDB achieves high availability and rapid recovery through an external HA service that combines a co‑located monitoring agent, a ZooKeeper‑backed scheduler for fault detection, decision making, and automated switch/rejoin/rebuild actions, and a VDL‑driven distributed storage recovery mechanism that prevents split‑brain scenarios.

Agent‑SchedulerCynosDBDatabase Architecture

0 likes · 19 min read

Design of High‑Availability System and Fast Recovery in Tencent CynosDB

Efficient Ops

Feb 17, 2019 · Operations

How AI-Powered Intelligent Operations Transform Network Fault Detection

This talk explains how Guangdong Mobile uses AI‑driven intelligent operations, including a centroid‑based fault‑location algorithm, standardized event‑distance models, and clustering techniques such as DBSCAN and nearest‑neighbor, to automate network alarm correlation, improve fault resolution, and enable predictive maintenance across a massive 4G/VoLTE network.

AIIntelligent Operationsfault detection

0 likes · 19 min read

How AI-Powered Intelligent Operations Transform Network Fault Detection

Alibaba Cloud Infrastructure

Oct 22, 2018 · Operations

Server Downtime Diagnosis System: Architecture, Implementation, and Results

The article explains why a downtime diagnosis system is needed, outlines its architecture and implementation methods—including log sources, feature extraction, and API integration—and presents early results showing high automation coverage and significant operational cost savings.

AutomationOperationsdiagnosis

0 likes · 7 min read

Server Downtime Diagnosis System: Architecture, Implementation, and Results

Alibaba Cloud Infrastructure

Feb 12, 2018 · Operations

Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization

Alibaba senior technical expert Houyi explains how intelligent network automation, rapid fault detection, automatic isolation, and traffic‑optimizing technologies were applied during Double 11 to dramatically improve stability, reduce costs, and enhance overall network performance across millions of devices.

AlibabaOperationsfault detection

0 likes · 16 min read

Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization

Alibaba Cloud Developer

Jan 5, 2018 · Operations

How Alibaba Scaled Double 11 with AI‑Driven Network Automation

Alibaba senior technologist Houyi explains how the company used AI‑powered network intelligence, automated fault detection, traffic scheduling, and smart routing to dramatically improve stability, reduce costs, and boost efficiency during the massive Double 11 shopping event.

AIAlibabaOperations

0 likes · 16 min read

How Alibaba Scaled Double 11 with AI‑Driven Network Automation

360 Zhihui Cloud Developer

Feb 14, 2017 · Operations

How Ceph Detects Node Failures: Heartbeat, Reporting, and Monitor Strategies

This article explains Ceph's fault detection mechanism, detailing how OSD peers exchange heartbeats, report failures to the Monitor, and how the Monitor aggregates reports and applies configurable thresholds to reliably identify and handle downed OSD nodes in a distributed storage cluster.

CephMonitorOSD

0 likes · 8 min read

How Ceph Detects Node Failures: Heartbeat, Reporting, and Monitor Strategies

Alibaba Cloud Developer

Dec 7, 2016 · Operations

How Alibaba Automates Its Network for Double 11 Traffic Surges

This article outlines Alibaba researcher Zhang Ming’s presentation on the network automation system that enables Alibaba’s infrastructure to handle the massive traffic and rapid fault recovery required during the Double 11 shopping festival, highlighting the challenges, detection methods, and automated tools used across routers, switches, and L4‑L7 devices.

AlibabaOperationsfault detection

0 likes · 3 min read

How Alibaba Automates Its Network for Double 11 Traffic Surges

Efficient Ops

Jun 14, 2016 · Operations

Automate Fault Root‑Cause Detection in Massive IT Operations

This article explains how large‑scale internet companies can reduce alarm storms and speed up incident resolution by creating an operations ecosystem centered on automated fault root‑cause localization, detailing the challenges, architecture, decision‑tree algorithms, and a four‑step implementation guide.

AutomationDecision TreeIT infrastructure

0 likes · 11 min read

Automate Fault Root‑Cause Detection in Massive IT Operations