How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale
This article explains how Alibaba’s massive MaxCompute platform tackles the growing challenge of hardware failures by using predictive detection, automated server offline, self‑healing workflows, and cluster rebalancing to close the fault loop before business impact, while detailing the underlying architecture and operational principles.
1. Challenges
MaxCompute, the offline compute platform that stores and processes 95% of Alibaba Group’s data, has grown to hundreds of thousands of servers. The scale makes hardware faults hard to detect at the software level, and the group‑wide hardware alarm thresholds often miss faults that affect applications, posing a serious stability challenge.
Two key problems are addressed: timely hardware fault detection and migration of workloads from faulty machines.
2. Tianji Application Management
MaxCompute runs on Alibaba’s Apsara operating system, and all applications are managed by the Tianji (天基) automation system, which handles hardware lifecycle and static resources. The self‑healing system tightly integrates with Tianji’s Healing mechanism to build a closed‑loop fault discovery and repair process.
3. Hardware Fault Detection
The focus is on common hardware components such as disks, memory, CPU, NICs, and power supplies. Detection methods include system log analysis, TSAR I/O metrics, and SMART value monitoring.
3.1 How to Discover
System log errors can be found in /var/log/messages.
Sep 3 13:43:22 host1.a1 kernel: : [14809594.557970] sd 6:0:11:0: [sdl] Sense Key : Medium Error [current]
Sep 3 20:39:56 host1.a1 kernel: : [61959097.553029] Buffer I/O error on device sdi1, logical block 796203507TSAR I/O metric spikes (e.g., qps=ws+rs<100 & util>90) often indicate disk issues when kernel problems are absent.
SMART value jumps, such as increases in 197 (Current_Pending_Sector) or 5 (Reallocated_Sector_Ct), reveal pending or reallocated sectors.
Multiple stages of analysis are required to confirm hardware problems and distinguish them from software issues.
Additional detection sources include tools described in earlier articles on SSD fault prediction and memory fault prediction.
4. Hardware Fault Self‑Healing
4.1 Self‑Healing Process
For each faulty machine, an automatic rotating ticket is created. Two workflows exist: a “with‑application” flow for hot‑swappable disk failures, and a “without‑application” flow for other hardware repairs.
4.2 Process Statistics Analysis
Repeated self‑healing of the same hardware issue (e.g., Lenovo RD640 virtual serial port problem) can be identified through ticket statistics, allowing isolation of problematic machines before they affect the cluster.
4.3 Business‑Related Pitfalls
While the self‑healing system can also handle software‑level issues, over‑reliance on it may mask underlying problems, leading to “band‑aid” solutions rather than root‑cause fixes.
5. Architecture Evolution
5.1 Cloudification
The original self‑healing architecture ran on each cluster’s control machine, but centralization was later adopted to improve data openness. This introduced new challenges in handling massive data volumes.
5.2 Data‑Driven Design
Distributed services were introduced, leveraging Alibaba Cloud Log Service (SLS), Blink stream computing, and AnalyticDB (ADS) to offload large‑scale data collection and analysis, leaving only core fault analysis on the server side.
5.3 Service‑Oriented
The fault‑self‑healing system is offered as a standardized hardware lifecycle service to various product lines, providing customizable perception thresholds and supporting full‑lifecycle management.
6. Summary
In the AIOps perception‑decision‑execution loop, software and hardware fault self‑healing are the most common use cases. Providing a generic self‑healing closed‑loop is foundational for AIOps and NoOps, especially for large‑scale distributed systems where hardware‑software conflicts are a major source of instability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
