Tagged articles
9 articles
Page 1 of 1
Architect
Architect
Dec 27, 2024 · Big Data

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

Big DataCluster Managementfault self-healing
0 likes · 16 min read
Fault Self‑Healing System for Large‑Scale Big Data Clusters
Bilibili Tech
Bilibili Tech
Dec 10, 2024 · Big Data

Fault Self‑Healing System for Bilibili's Large‑Scale Big Data Cluster (BMR)

Bilibili's fault‑self‑healing platform for its massive BMR big‑data cluster—over 10,000 machines and 1 EB storage—adds near‑real‑time fault discovery, intelligent diagnosis, and automated workflow handling, dramatically cutting resolution time, improving stability across services, and scaling to dozens of daily automated repairs.

BMRCluster Managementfault self-healing
0 likes · 16 min read
Fault Self‑Healing System for Bilibili's Large‑Scale Big Data Cluster (BMR)
Bilibili Tech
Bilibili Tech
Jul 19, 2024 · Big Data

Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation

Bilibili’s one‑stop Big Data Cluster Management Platform (BMR) consolidates HDFS, Spark, Flink, ClickHouse, Kafka and other services into a unified system that evolved through four stages—standardization, metadata‑driven construction, containerization, and observability—addressing node consistency, scaling, fault self‑healing, and resource optimization while delivering elastic scaling, automated start/stop, and future cost‑saving and stability enhancements.

Cluster ManagementObservabilityResource Optimization
0 likes · 12 min read
Bilibili's One-Stop Big Data Cluster Management Platform (BMR) - Architecture and Implementation
Efficient Ops
Efficient Ops
Mar 18, 2024 · Operations

How to Implement Fault Self‑Healing for Scalable Operations

This article explains why low‑disk alerts demand automation, outlines the concept of fault self‑healing versus manual response, and provides practical guidelines—including standards, monitoring dimensions, CMDB integration, script execution tools, and notification channels—to build a reliable self‑healing system for large‑scale environments.

CMDBfault self-healingmonitoring
0 likes · 10 min read
How to Implement Fault Self‑Healing for Scalable Operations
Efficient Ops
Efficient Ops
May 30, 2023 · Operations

Mastering Fault Self-Healing: Automate Disk Alerts and Scale Operations

Discover how to transform nightly disk‑space alerts into automated self‑healing workflows, covering prerequisite standards, multi‑dimensional monitoring, CMDB integration, script‑based remediation, and multi‑channel notifications to scale operations across thousands of servers without manual intervention.

CMDBDevOpsOperations Automation
0 likes · 10 min read
Mastering Fault Self-Healing: Automate Disk Alerts and Scale Operations
Efficient Ops
Efficient Ops
Jan 16, 2023 · Artificial Intelligence

How China Mobile’s AIOps Platform Achieved Top‑Tier Evaluation and What It Means for Intelligent Operations

This article explains the concept of AIOps, details China Mobile Information Technology's successful comprehensive‑level assessment of its centralized operations management platform's fault‑self‑healing module, shares insights from an interview with the project director, and introduces the national AIOps capability maturity model.

AI in ITCapability Maturity ModelChina Mobile
0 likes · 9 min read
How China Mobile’s AIOps Platform Achieved Top‑Tier Evaluation and What It Means for Intelligent Operations
ITPUB
ITPUB
Nov 27, 2022 · Operations

How to Build an Automated Fault‑Self‑Healing System for Large‑Scale Operations

This article explains why nightly disk‑space alerts demand automated fault‑self‑healing, outlines the necessary process standards, describes monitoring platform dimensions, details a multi‑source self‑healing platform with CMDB integration, and provides practical options for script execution and result notification.

CMDBDevOpsOperations Automation
0 likes · 9 min read
How to Build an Automated Fault‑Self‑Healing System for Large‑Scale Operations
dbaplus Community
dbaplus Community
Nov 7, 2022 · Operations

Automating Fault Self‑Healing: A Practical Guide for Operations Teams

This article explains why disk‑space alerts demand automated handling, introduces the concept of fault self‑healing, outlines required process standards, describes monitoring platform dimensions, details a multi‑source self‑healing platform architecture, and offers practical steps for integration, notification, and continuous improvement.

CMDBOperationsfault self-healing
0 likes · 9 min read
Automating Fault Self‑Healing: A Practical Guide for Operations Teams
Efficient Ops
Efficient Ops
Aug 31, 2022 · Operations

How to Build Scalable Fault Self‑Healing for Modern Operations

This article explains why traditional manual responses to alerts are insufficient, outlines the concept of fault self‑healing, and provides a step‑by‑step guide on establishing standards, monitoring dimensions, a unified CMDB, automation tools, and notification channels to achieve automated recovery at scale.

CMDBfault self-healing
0 likes · 9 min read
How to Build Scalable Fault Self‑Healing for Modern Operations