Operations 14 min read

How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale

This article explains how Alibaba’s massive MaxCompute platform tackles the growing challenge of hardware failures by using predictive detection, automated server offline, self‑healing workflows, and cluster rebalancing to close the fault loop before business impact, while detailing the underlying architecture and operational principles.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale

1. Challenges

MaxCompute, the offline compute platform that stores and processes 95% of Alibaba Group’s data, has grown to hundreds of thousands of servers. The scale makes hardware faults hard to detect at the software level, and the group‑wide hardware alarm thresholds often miss faults that affect applications, posing a serious stability challenge.

Two key problems are addressed: timely hardware fault detection and migration of workloads from faulty machines.

2. Tianji Application Management

MaxCompute runs on Alibaba’s Apsara operating system, and all applications are managed by the Tianji (天基) automation system, which handles hardware lifecycle and static resources. The self‑healing system tightly integrates with Tianji’s Healing mechanism to build a closed‑loop fault discovery and repair process.

3. Hardware Fault Detection

The focus is on common hardware components such as disks, memory, CPU, NICs, and power supplies. Detection methods include system log analysis, TSAR I/O metrics, and SMART value monitoring.

3.1 How to Discover

System log errors can be found in /var/log/messages.

Sep 3 13:43:22 host1.a1 kernel: : [14809594.557970] sd 6:0:11:0: [sdl] Sense Key : Medium Error [current]
Sep 3 20:39:56 host1.a1 kernel: : [61959097.553029] Buffer I/O error on device sdi1, logical block 796203507

TSAR I/O metric spikes (e.g., qps=ws+rs<100 & util>90) often indicate disk issues when kernel problems are absent.

SMART value jumps, such as increases in 197 (Current_Pending_Sector) or 5 (Reallocated_Sector_Ct), reveal pending or reallocated sectors.

Multiple stages of analysis are required to confirm hardware problems and distinguish them from software issues.

Additional detection sources include tools described in earlier articles on SSD fault prediction and memory fault prediction.

4. Hardware Fault Self‑Healing

4.1 Self‑Healing Process

For each faulty machine, an automatic rotating ticket is created. Two workflows exist: a “with‑application” flow for hot‑swappable disk failures, and a “without‑application” flow for other hardware repairs.

4.2 Process Statistics Analysis

Repeated self‑healing of the same hardware issue (e.g., Lenovo RD640 virtual serial port problem) can be identified through ticket statistics, allowing isolation of problematic machines before they affect the cluster.

4.3 Business‑Related Pitfalls

While the self‑healing system can also handle software‑level issues, over‑reliance on it may mask underlying problems, leading to “band‑aid” solutions rather than root‑cause fixes.

5. Architecture Evolution

5.1 Cloudification

The original self‑healing architecture ran on each cluster’s control machine, but centralization was later adopted to improve data openness. This introduced new challenges in handling massive data volumes.

5.2 Data‑Driven Design

Distributed services were introduced, leveraging Alibaba Cloud Log Service (SLS), Blink stream computing, and AnalyticDB (ADS) to offload large‑scale data collection and analysis, leaving only core fault analysis on the server side.

5.3 Service‑Oriented

The fault‑self‑healing system is offered as a standardized hardware lifecycle service to various product lines, providing customizable perception thresholds and supporting full‑lifecycle management.

6. Summary

In the AIOps perception‑decision‑execution loop, software and hardware fault self‑healing are the most common use cases. Providing a generic self‑healing closed‑loop is foundational for AIOps and NoOps, especially for large‑scale distributed systems where hardware‑software conflicts are a major source of instability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Alibaba CloudaiopsOperations Automationhardware fault detectionautomated self-healing
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.