Big Data 16 min read

Fault Self‑Healing System for Large‑Scale Big Data Clusters

This article describes the design, architecture, and technical implementation of BMR's fault self‑healing platform, which automatically collects data, analyzes failures, defines decision rules, and executes safe recovery workflows to improve reliability and efficiency of massive, heterogeneous big‑data environments.

Architect
Architect
Architect
Fault Self‑Healing System for Large‑Scale Big Data Clusters

1. Background BMR manages over 10,000 machines, 50+ service components, more than 1 EB of storage and over one million CPU cores, forming a massive and heterogeneous big‑data cluster. The scale, complex service mixing, and diverse hardware make fault detection and remediation challenging.

2. Architecture Design The self‑healing system consists of four stages: data collection, fault analysis, fault handling, and decision definition. Near‑real‑time data is gathered from host metrics, logs, alarm subscriptions, system‑provided fault data, and business metadata, then stored in a unified meta‑warehouse.

3. Technical Implementation

3.1 Data Collection – Captures host and service abnormal metrics, logs, alarm data, system‑provided fault records, and business metadata (service, cluster, component, tag). Sources include /var/log/message, /var/log/syslog, etc.

3.1.2 Meta‑Warehouse – Aggregates all collected data, tags it with business and host identifiers, and writes the unified stream to Kafka, finally syncing to StarRocks for analysis.

3.2 Decision Definition – Builds a knowledge base containing fault categories, levels, descriptions, impact scopes, and judgment conditions, as well as business impact metadata, enabling precise impact assessment and appropriate remediation actions.

3.3 Fault Analysis – Reads recent data from StarRocks, aggregates and filters it, enriches fault details (e.g., disk type, affected services) using multi‑dimensional data, and performs noise reduction by de‑duplicating frequent identical faults and applying silent periods.

3.4 Fault Handling – Generates a workflow composed of sub‑tasks (defense, script, deployment, hardware repair, node isolation, notification). Each task type has specific safety checks, rate‑limiting, and timeout/retry policies. The workflow is stored in a database and scheduled for execution, with manual override options when needed.

4. Summary and Outlook The self‑healing system has been applied to critical components such as NodeManager, DataNode, Flink, ClickHouse, Kafka, and Spark, handling over 20 fault cases daily and saving at least one full‑time engineer’s workload. Future work includes expanding coverage to more services, adding predictive fault analysis via machine learning, and further automating the recovery process.

monitoringBig DataautomationoperationsCluster Managementfault self-healing
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.