Tag

fault detection

0 views collected around this technical thread.

DataFunSummit
DataFunSummit
May 22, 2025 · Operations

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

This article presents Grab's Hugo platform, an automated fault‑detection and self‑healing system for over 4,000 data pipelines that combines multi‑source signal collection, intelligent diagnosis, layered auto‑repair, and a health API to dramatically improve data visibility, reduce manual intervention, and boost operational efficiency across the company.

AutomationBig DataDataOps
0 likes · 12 min read
Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact
Qunar Tech Salon
Qunar Tech Salon
Nov 22, 2023 · Operations

Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis

This article details Qunar's comprehensive overhaul of its monitoring platform—introducing second‑level metrics, redesigning storage with VictoriaMetrics, optimizing client and server data collection, and building a root‑cause analysis tool—to dramatically reduce order‑related fault discovery time from minutes to under one minute.

MicroservicesTSDBcloud-native
0 likes · 22 min read
Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis
Didi Tech
Didi Tech
Sep 5, 2023 · Operations

Observability and Stability Engineering in Didi Ride‑Hailing Platform

At Didi, observability and stability engineering combine automated, AI‑driven alarm generation, distributed tracing, and ChatOps‑based fault handling to manage micro‑service complexity, massive traffic spikes, and cross‑region operations, emphasizing systematic investment, AIOps evolution, and a recruitment call for backend and test engineers.

AIOpsDidiDistributed Systems
0 likes · 16 min read
Observability and Stability Engineering in Didi Ride‑Hailing Platform
JD Tech Talk
JD Tech Talk
Jan 8, 2021 · Artificial Intelligence

AIOps: Background, Scenarios, Capability Building, and Practical Implementation by JD Digital Operations Team

This article explains the evolution of IT operations toward AIOps, outlines its key scenarios, describes the team roles and capability‑building roadmap, and details JD Digital Operations' practical implementations—including fault detection, localization, and automated repair—leveraging AI, big data, and knowledge‑graph technologies.

AIOpsArtificial IntelligenceAutomation
0 likes · 12 min read
AIOps: Background, Scenarios, Capability Building, and Practical Implementation by JD Digital Operations Team
High Availability Architecture
High Availability Architecture
Oct 22, 2020 · Artificial Intelligence

AIOps at Meituan: Architecture, Design, and Practice of the Horae Time‑Series Anomaly Detection System

This article presents Meituan's AIOps exploration, focusing on the design and implementation of the Horae time‑series anomaly detection platform, covering background, technical roadmap, fault‑discovery workflow, time‑series classification, feature engineering, model training, real‑time detection, and future directions.

AIOpsHoraeMeituan
0 likes · 31 min read
AIOps at Meituan: Architecture, Design, and Practice of the Horae Time‑Series Anomaly Detection System
360 Quality & Efficiency
360 Quality & Efficiency
Feb 28, 2020 · Operations

External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies

The article details 360's external network quality monitoring system, explaining its background, real‑time detection features, CDN‑based source host selection, three‑layer architecture, data collection and storage pipelines, fault‑diagnosis strategies, and visualization approaches for rapid network fault localization.

CDNOperationsalerting
0 likes · 10 min read
External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies
Tencent Cloud Developer
Tencent Cloud Developer
Mar 29, 2019 · Databases

Design of High‑Availability System and Fast Recovery in Tencent CynosDB

Tencent CynosDB achieves high availability and rapid recovery through an external HA service that combines a co‑located monitoring agent, a ZooKeeper‑backed scheduler for fault detection, decision making, and automated switch/rejoin/rebuild actions, and a VDL‑driven distributed storage recovery mechanism that prevents split‑brain scenarios.

Agent‑SchedulerCynosDBDatabase Architecture
0 likes · 19 min read
Design of High‑Availability System and Fast Recovery in Tencent CynosDB
Efficient Ops
Efficient Ops
Feb 17, 2019 · Operations

How AI-Powered Intelligent Operations Transform Network Fault Detection

This talk explains how Guangdong Mobile uses AI‑driven intelligent operations, including a centroid‑based fault‑location algorithm, standardized event‑distance models, and clustering techniques such as DBSCAN and nearest‑neighbor, to automate network alarm correlation, improve fault resolution, and enable predictive maintenance across a massive 4G/VoLTE network.

AIAutomationDevOps
0 likes · 19 min read
How AI-Powered Intelligent Operations Transform Network Fault Detection
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Oct 22, 2018 · Operations

Server Downtime Diagnosis System: Architecture, Implementation, and Results

The article explains why a downtime diagnosis system is needed, outlines its architecture and implementation methods—including log sources, feature extraction, and API integration—and presents early results showing high automation coverage and significant operational cost savings.

AutomationDiagnosisOperations
0 likes · 7 min read
Server Downtime Diagnosis System: Architecture, Implementation, and Results
Efficient Ops
Efficient Ops
Oct 16, 2018 · Operations

How Tencent Built an AI‑Powered Network Fault Detection System in Minutes

In this talk, Tencent’s infrastructure lead explains how their team created an AI‑driven, three‑minute fault detection and recovery pipeline—combining high‑precision Meshping monitoring, multi‑KPI analytics, and automated Moveout isolation—to dramatically shorten network outage resolution from hours to minutes.

AIOpsAutomationfault detection
0 likes · 18 min read
How Tencent Built an AI‑Powered Network Fault Detection System in Minutes
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 12, 2018 · Operations

Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization

Alibaba senior technical expert Houyi explains how intelligent network automation, rapid fault detection, automatic isolation, and traffic‑optimizing technologies were applied during Double 11 to dramatically improve stability, reduce costs, and enhance overall network performance across millions of devices.

AlibabaDouble 11Network Automation
0 likes · 16 min read
Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Feb 14, 2017 · Operations

How Ceph Detects Node Failures: Heartbeat, Reporting, and Monitor Strategies

This article explains Ceph's fault detection mechanism, detailing how OSD peers exchange heartbeats, report failures to the Monitor, and how the Monitor aggregates reports and applies configurable thresholds to reliably identify and handle downed OSD nodes in a distributed storage cluster.

CephDistributed SystemsMonitor
0 likes · 8 min read
How Ceph Detects Node Failures: Heartbeat, Reporting, and Monitor Strategies
Efficient Ops
Efficient Ops
Jun 14, 2016 · Operations

Automate Fault Root‑Cause Detection in Massive IT Operations

This article explains how large‑scale internet companies can reduce alarm storms and speed up incident resolution by creating an operations ecosystem centered on automated fault root‑cause localization, detailing the challenges, architecture, decision‑tree algorithms, and a four‑step implementation guide.

AutomationIT infrastructureOperations
0 likes · 11 min read
Automate Fault Root‑Cause Detection in Massive IT Operations