Tagged articles
15 articles
Page 1 of 1
dbaplus Community
dbaplus Community
Jul 24, 2025 · Operations

How Bilibili Scales Server Fault Management with Automated Detection and Repair

This article details Bilibili's approach to handling explosive growth in server count by classifying faults, identifying shortcomings of manual processes, and implementing an automated, end‑to‑end detection, rule‑based alerting, and repair workflow that combines in‑band and out‑of‑band data collection to achieve near‑perfect coverage and accuracy.

Data centerfault detectionin‑band
0 likes · 17 min read
How Bilibili Scales Server Fault Management with Automated Detection and Repair
DataFunSummit
DataFunSummit
May 22, 2025 · Operations

Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact

This article presents Grab's Hugo platform, an automated fault‑detection and self‑healing system for over 4,000 data pipelines that combines multi‑source signal collection, intelligent diagnosis, layered auto‑repair, and a health API to dramatically improve data visibility, reduce manual intervention, and boost operational efficiency across the company.

AutomationBig DataDataOps
0 likes · 12 min read
Automated Fault Detection and Repair System for Grab's Data Pipelines (Hugo) – Architecture, Implementation, and Impact
Qunar Tech Salon
Qunar Tech Salon
Nov 22, 2023 · Operations

Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis

This article details Qunar's comprehensive overhaul of its monitoring platform—introducing second‑level metrics, redesigning storage with VictoriaMetrics, optimizing client and server data collection, and building a root‑cause analysis tool—to dramatically reduce order‑related fault discovery time from minutes to under one minute.

MicroservicesOperationsTSDB
0 likes · 22 min read
Optimizing Qunar's Monitoring System for Faster Fault Detection and Root‑Cause Analysis
Didi Tech
Didi Tech
Sep 5, 2023 · Operations

Observability and Stability Engineering in Didi Ride‑Hailing Platform

At Didi, observability and stability engineering combine automated, AI‑driven alarm generation, distributed tracing, and ChatOps‑based fault handling to manage micro‑service complexity, massive traffic spikes, and cross‑region operations, emphasizing systematic investment, AIOps evolution, and a recruitment call for backend and test engineers.

DidiDistributed SystemsObservability
0 likes · 16 min read
Observability and Stability Engineering in Didi Ride‑Hailing Platform
JD Tech Talk
JD Tech Talk
Jan 8, 2021 · Artificial Intelligence

AIOps: Background, Scenarios, Capability Building, and Practical Implementation by JD Digital Operations Team

This article explains the evolution of IT operations toward AIOps, outlines its key scenarios, describes the team roles and capability‑building roadmap, and details JD Digital Operations' practical implementations—including fault detection, localization, and automated repair—leveraging AI, big data, and knowledge‑graph technologies.

AutomationIT Operationsaiops
0 likes · 12 min read
AIOps: Background, Scenarios, Capability Building, and Practical Implementation by JD Digital Operations Team
High Availability Architecture
High Availability Architecture
Oct 22, 2020 · Artificial Intelligence

AIOps at Meituan: Architecture, Design, and Practice of the Horae Time‑Series Anomaly Detection System

This article presents Meituan's AIOps exploration, focusing on the design and implementation of the Horae time‑series anomaly detection platform, covering background, technical roadmap, fault‑discovery workflow, time‑series classification, feature engineering, model training, real‑time detection, and future directions.

HoraeMeituanaiops
0 likes · 31 min read
AIOps at Meituan: Architecture, Design, and Practice of the Horae Time‑Series Anomaly Detection System
360 Quality & Efficiency
360 Quality & Efficiency
Feb 28, 2020 · Operations

External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies

The article details 360's external network quality monitoring system, explaining its background, real‑time detection features, CDN‑based source host selection, three‑layer architecture, data collection and storage pipelines, fault‑diagnosis strategies, and visualization approaches for rapid network fault localization.

AlertingCDNNetwork Monitoring
0 likes · 10 min read
External Network Quality Monitoring System at 360: Architecture, Features, and Alert Strategies
Tencent Cloud Developer
Tencent Cloud Developer
Mar 29, 2019 · Databases

Design of High‑Availability System and Fast Recovery in Tencent CynosDB

Tencent CynosDB achieves high availability and rapid recovery through an external HA service that combines a co‑located monitoring agent, a ZooKeeper‑backed scheduler for fault detection, decision making, and automated switch/rejoin/rebuild actions, and a VDL‑driven distributed storage recovery mechanism that prevents split‑brain scenarios.

Agent‑SchedulerCynosDBDatabase Architecture
0 likes · 19 min read
Design of High‑Availability System and Fast Recovery in Tencent CynosDB
Efficient Ops
Efficient Ops
Feb 17, 2019 · Operations

How AI-Powered Intelligent Operations Transform Network Fault Detection

This talk explains how Guangdong Mobile uses AI‑driven intelligent operations, including a centroid‑based fault‑location algorithm, standardized event‑distance models, and clustering techniques such as DBSCAN and nearest‑neighbor, to automate network alarm correlation, improve fault resolution, and enable predictive maintenance across a massive 4G/VoLTE network.

AIIntelligent Operationsfault detection
0 likes · 19 min read
How AI-Powered Intelligent Operations Transform Network Fault Detection
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Feb 12, 2018 · Operations

Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization

Alibaba senior technical expert Houyi explains how intelligent network automation, rapid fault detection, automatic isolation, and traffic‑optimizing technologies were applied during Double 11 to dramatically improve stability, reduce costs, and enhance overall network performance across millions of devices.

AlibabaOperationsfault detection
0 likes · 16 min read
Intelligent Network Practices for Alibaba's Double 11: Automation, Fault Detection, and Traffic Optimization
Alibaba Cloud Developer
Alibaba Cloud Developer
Dec 7, 2016 · Operations

How Alibaba Automates Its Network for Double 11 Traffic Surges

This article outlines Alibaba researcher Zhang Ming’s presentation on the network automation system that enables Alibaba’s infrastructure to handle the massive traffic and rapid fault recovery required during the Double 11 shopping festival, highlighting the challenges, detection methods, and automated tools used across routers, switches, and L4‑L7 devices.

AlibabaOperationsfault detection
0 likes · 3 min read
How Alibaba Automates Its Network for Double 11 Traffic Surges
Efficient Ops
Efficient Ops
Jun 14, 2016 · Operations

Automate Fault Root‑Cause Detection in Massive IT Operations

This article explains how large‑scale internet companies can reduce alarm storms and speed up incident resolution by creating an operations ecosystem centered on automated fault root‑cause localization, detailing the challenges, architecture, decision‑tree algorithms, and a four‑step implementation guide.

AutomationIT infrastructureOperations
0 likes · 11 min read
Automate Fault Root‑Cause Detection in Massive IT Operations