Tagged articles
3 articles
Page 1 of 1
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Aug 5, 2025 · Operations

How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale

This article explains how Alibaba’s massive MaxCompute platform tackles the growing challenge of hardware failures by using predictive detection, automated server offline, self‑healing workflows, and cluster rebalancing to close the fault loop before business impact, while detailing the underlying architecture and operational principles.

Alibaba CloudOperations Automationaiops
0 likes · 14 min read
How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale
Efficient Ops
Efficient Ops
Nov 27, 2018 · Operations

How Alibaba Automates Server Fault Detection and Self‑Healing at Scale

Alibaba’s massive data‑center operations face growing hardware failures, so they built the DAM (Dammo) platform that integrates Tianji management, predictive fault detection, automated remediation, and self‑balancing cluster reconstruction, achieving near‑complete hardware issue coverage and reducing manual intervention across hundreds of thousands of servers.

Operationsaiopscloud computing
0 likes · 17 min read
How Alibaba Automates Server Fault Detection and Self‑Healing at Scale
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 19, 2018 · Operations

How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale

This article explains how Alibaba’s massive data‑center operations detect hardware failures early, automatically isolate faulty servers, and execute self‑healing workflows through a centralized, cloud‑native platform, detailing detection methods, convergence rules, architecture evolution, and the benefits of a closed‑loop AIOps system.

Operationsaiopscloud-native
0 likes · 15 min read
How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale