How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale
This article explains how Alibaba’s massive data‑center operations detect hardware failures early, automatically isolate faulty servers, and execute self‑healing workflows through a centralized, cloud‑native platform, detailing detection methods, convergence rules, architecture evolution, and the benefits of a closed‑loop AIOps system.
1. Background
1.1. Challenges
MaxCompute, the offline computing platform that stores and processes 95% of Alibaba Group’s data, now runs on hundreds of thousands of servers. The scale and offline‑job characteristics make hardware faults hard to detect at the software layer, and the group‑wide fault‑threshold policy often misses issues that affect applications.
1.2. Tianji Application Management
MaxCompute runs on Alibaba’s data‑center OS, Apsara. All applications are managed by Tianji, an automated data‑center management system that handles hardware lifecycle and static resources. The hardware self‑healing system integrates tightly with Tianji’s Healing mechanism to build a closed‑loop for fault discovery and remediation.
2. Hardware Fault Detection
2.1. How to Detect
The main hardware components monitored are disks, memory, CPU, NICs, and power supplies. Common detection methods and tools include system logs, tsar I/O metrics, and SMART values.
Disk failures account for over 50% of hardware issues. Typical symptoms are read/write errors or slow I/O, but these do not always indicate media faults.
Sep 3 13:43:22 host1.a1 kernel: : [14809594.557970] sd 6:0:11:0: [sdl] Sense Key : Medium Error [current]
Sep 3 20:39:56 host1.a1 kernel: : [61959097.553029] Buffer I/O error on device sdi1, logical block 796203507Tsar I/O metrics such as high util (>90%) combined with low IOPS (<30) for more than 10 minutes often indicate a disk fault.
In tsar I/O, the rule qps=ws+rs<100 & util>90 helps differentiate normal high load from hardware problems.
SMART value jumps, especially 197(Current_Pending_Sector) and 5(Reallocated_Sector_Ct), reveal pending or reallocated sectors, confirming media degradation.
When a pending sector is confirmed, 197 decreases; if the write fails, 197 decreases and 5 increases.
Effective fault diagnosis requires correlating multiple stages of evidence to distinguish hardware from software problems.
2.2. How to Converge
Metrics should be application‑agnostic: High I/O util alone is insufficient; only when util > 90% and IOPS < 30 for a sustained period do we suspect hardware.
Collect comprehensively, converge cautiously: All potential fault indicators are collected, but most serve as references; a fault is reported only when clear hardware signals (e.g., SMART failures) appear.
2.3. Coverage
Excluding out‑of‑band failures, hardware fault detection covered 97.6% of incidents in the sample cluster.
3. Hardware Fault Self‑Healing
3.1. Self‑Healing Process
Two automated workflows are used: a “with‑application” process for hot‑swappable disk faults and a “without‑application” process for full‑machine repairs.
Diskless diagnosis: For a downed machine, a “no‑disk” (ramos) test is run before opening a repair ticket, reducing false positives.
Impact assessment / escalation: If a process is stuck >10 min, the fault is escalated to a full‑machine reboot; if reboot fails, the workflow automatically switches to the “without‑application” path.
Unknown‑issue fallback: When no hardware issue is detected after stress testing, the machine is re‑installed; some hardware problems are discovered during reinstall.
Crash analysis: The workflow also captures crash diagnostics, turning failure handling into a side‑product of self‑healing.
3.2. Process Statistics
Repeated self‑healing of the same hardware model (e.g., Lenovo RD640) revealed persistent issues, prompting isolation of affected machines to protect cluster stability.
3.3. Business Correlation Pitfalls
While the self‑healing system can address some software‑level problems, over‑reliance may mask underlying issues, so non‑hardware problems are gradually removed from the pipeline.
4. Architecture Evolution
4.1. Cloudification
The original self‑healing logic lived on each cluster’s control node. To break data silos, a centralized architecture was adopted, then further refactored into distributed services using Alibaba Cloud Log Service (SLS), Blink Stream Compute, and AnalyticDB (ADS) for large‑scale data processing.
4.2. Data‑driven
Collected metrics are aggregated into a health score for each machine, rack, and cluster, enabling operators to quickly assess hardware health at any granularity.
4.3. Service‑oriented
The self‑healing platform is offered as a standardized hardware‑lifecycle service, with configurable thresholds for different product lines, turning the internal capability into a reusable service.
5. Fault Self‑Healing Closed‑Loop
5.1. Necessity
In complex distributed systems, hardware‑software conflicts arise from information asymmetry. Explicitly modeling self‑healing as a first‑class behavior lets software components participate in remediation, turning conflicts into coordinated actions.
5.2. Universality
Automation reduces the side‑effects of manual operations.
Human‑driven maintenance gradually becomes fully automated.
Each operation is bound by a clear SLA, eliminating unpredictable script failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
