Tagged articles
56 articles
Page 1 of 1
ITPUB
ITPUB
May 11, 2026 · Databases

What Human Evolution Teaches About IT Architecture Trade‑offs (Ahead of the 2026 SACC)

The article draws a detailed analogy between millions of years of human evolution—standing up, shedding hair, expanding the brain, and recruiting ancient bacteria—and modern IT architecture, showing how each design choice brings hidden costs, why perfect systems are impossible, and how embracing trade‑offs, extensions, and continuous iteration can lead to resilient, self‑healing databases.

Cloud Nativearchitecturedatabases
0 likes · 22 min read
What Human Evolution Teaches About IT Architecture Trade‑offs (Ahead of the 2026 SACC)
DeepHub IMBA
DeepHub IMBA
May 7, 2026 · Frontend Development

Self‑Healing Playwright Tests with LLM‑Driven Locator Recovery

This article shows how to combine Playwright with an LLM (Groq) to build a self‑healing test framework that detects broken selectors, extracts a trimmed DOM snapshot, asks the model for a replacement locator, validates confidence, caches results, and integrates the logic via a Playwright fixture.

GroqJavaScriptLLM
0 likes · 17 min read
Self‑Healing Playwright Tests with LLM‑Driven Locator Recovery
FunTester
FunTester
Apr 28, 2026 · Operations

How Self‑Healing Automation Platforms Transform SRE Practices

The article explains how a self‑healing platform improves SRE reliability by reducing MTTR, preserving error‑budget, automating high‑impact incident remediation, enforcing safety guardrails, and shifting team focus from firefighting to sustainable reliability engineering.

AutomationError BudgetMTTR
0 likes · 10 min read
How Self‑Healing Automation Platforms Transform SRE Practices
FunTester
FunTester
Apr 27, 2026 · Operations

Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help

The article explains that large‑scale incidents overwhelm on‑call engineers who must manually piece together context from countless signals, and shows how a self‑healing automation platform can take over repetitive, known failure patterns, verify fixes, and reduce fatigue while keeping humans in the loop for oversight.

AutomationOperationsSRE
0 likes · 8 min read
Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help
Ray's Galactic Tech
Ray's Galactic Tech
Apr 24, 2026 · Backend Development

Self‑Healing Agents: Rebuilding a High‑Concurrency Travel System with Spring AI ReAct

This article details how a legacy travel‑booking service was transformed into a production‑grade, self‑healing agent system using Spring AI ReAct and multi‑tool coordination, covering architectural redesign, tool governance, error semantics, high‑concurrency safeguards, observability, security, and real‑world performance gains.

AgentBackendReact
0 likes · 31 min read
Self‑Healing Agents: Rebuilding a High‑Concurrency Travel System with Spring AI ReAct
Woodpecker Software Testing
Woodpecker Software Testing
Apr 24, 2026 · Operations

Self-Healing UI Test Scripts: Boost Performance and Reliability

The article explains how fragile UI automation scripts hinder performance testing and shows a three‑layer self‑healing approach using Playwright and Python that reduces script failures, cuts maintenance time, and integrates with monitoring to quickly detect UI performance issues.

AutomationPerformance TestingPlaywright
0 likes · 7 min read
Self-Healing UI Test Scripts: Boost Performance and Reliability
Woodpecker Software Testing
Woodpecker Software Testing
Apr 18, 2026 · Operations

Deep Dive into Performance Optimization for Self‑Healing Test Scripts

The article examines why self‑healing test scripts increase runtime overhead, breaks down the underlying mechanisms, and presents four concrete optimization tactics—layered healing, locator caching, visual/semantic throttling, and asynchronous repair—backed by real‑world case data showing up to 43% faster regressions and 52% lower maintenance cost.

DevOpsPerformance OptimizationUI testing
0 likes · 8 min read
Deep Dive into Performance Optimization for Self‑Healing Test Scripts
StarRocks
StarRocks
Apr 16, 2026 · Databases

Why Traditional Databases Stall AI Agents—and How StarRocks Overcomes the Bottleneck

Traditional databases were built for low‑frequency, human‑driven queries, but AI agents generate dozens of concurrent, sub‑second queries that expose architectural limits, and StarRocks addresses these challenges with self‑healing optimization, real‑time data pipelines, extreme concurrency handling, and seamless lakehouse access.

Database ConcurrencyLakehouseReal-time analytics
0 likes · 13 min read
Why Traditional Databases Stall AI Agents—and How StarRocks Overcomes the Bottleneck
DevOps Coach
DevOps Coach
Apr 15, 2026 · Cloud Computing

How We Scaled to 6,000 AWS Accounts with a 3‑Engineer Team: A Self‑Healing Automation Blueprint

This article details how a SaaS platform transformed its AWS multi‑account management from manual, toil‑heavy processes to a fully automated, self‑healing system that now handles over 6,000 accounts with just three engineers, achieving sub‑5‑minute provisioning, 99.8% compliance, and massive cost savings.

AWSAutomationInfrastructure as Code
0 likes · 15 min read
How We Scaled to 6,000 AWS Accounts with a 3‑Engineer Team: A Self‑Healing Automation Blueprint
Woodpecker Software Testing
Woodpecker Software Testing
Apr 15, 2026 · Artificial Intelligence

How AI Testing Tools Redefine Performance Optimization: A New Paradigm

Amid exploding large‑model deployments, AI teams struggle with slow test feedback, but AI‑native testing tools—through intelligent load modeling, inference‑layer root‑cause analysis, and self‑healing loops—demonstrate concrete latency reductions, resource savings, and faster issue remediation.

AI testingMLOpsObservability
0 likes · 6 min read
How AI Testing Tools Redefine Performance Optimization: A New Paradigm
Woodpecker Software Testing
Woodpecker Software Testing
Mar 18, 2026 · Operations

How Self‑Healing UI Test Scripts Boost Performance Testing Reliability

The article explains why traditional UI automation scripts break under high‑load performance testing and presents a deterministic, three‑level self‑healing framework—locator elasticity, timing adaptation, and flexible assertions—implemented with Python + Playwright in a banking transaction system, raising script stability from 41 % to 96.5 % at 5 k TPS.

JMeterPerformance TestingPlaywright
0 likes · 8 min read
How Self‑Healing UI Test Scripts Boost Performance Testing Reliability
Woodpecker Software Testing
Woodpecker Software Testing
Mar 17, 2026 · Operations

How Self‑Healing Test Scripts Make UI Automation Truly Live

The article explains why traditional UI automation scripts break on minor UI changes, introduces self‑healing test scripts that combine detection, analysis, repair and verification layers, compares commercial, framework‑enhanced and in‑house implementations, and outlines three common pitfalls to avoid for reliable, resilient test automation.

CIPlaywrightUI automation
0 likes · 9 min read
How Self‑Healing Test Scripts Make UI Automation Truly Live
ShiZhen AI
ShiZhen AI
Mar 11, 2026 · Artificial Intelligence

Build Persistent AI Agents with OpenClaw: A 40‑Day Hands‑On Guide

This article details a 40‑day workflow for creating and evolving eight continuous‑running OpenClaw AI agents using a three‑layer markdown file system—Identity, Operations, and Knowledge—showing how to give agents long‑term memory, self‑healing checks, and coordinated collaboration without databases or message queues.

AI agentsAgent CoordinationLong-term Memory
0 likes · 17 min read
Build Persistent AI Agents with OpenClaw: A 40‑Day Hands‑On Guide
AI Tech Publishing
AI Tech Publishing
Mar 9, 2026 · Artificial Intelligence

How to Build a 24‑7 Autonomous AI Agent Team with OpenClaw

This guide walks through setting up a continuously running AI Agent Team using OpenClaw, covering hardware choices, installation, file structure, agent roles, coordination via markdown files, scheduling, self‑healing cron jobs, security, cost, troubleshooting, and step‑by‑step recommendations for incremental deployment.

AI agentsAutomationCron scheduling
0 likes · 20 min read
How to Build a 24‑7 Autonomous AI Agent Team with OpenClaw
Woodpecker Software Testing
Woodpecker Software Testing
Mar 3, 2026 · Operations

Self-Healing Test Scripts: End Frequent Maintenance Hassles

The article explains how self‑healing test scripts, built on observable snapshots, strategy libraries, and lightweight decision engines, can automatically detect UI changes, diagnose locator failures, and apply semantic or visual fixes, dramatically reducing maintenance time and manual intervention in fast‑paced continuous delivery environments.

ObservabilityPythonSelenium
0 likes · 7 min read
Self-Healing Test Scripts: End Frequent Maintenance Hassles
Huolala Tech
Huolala Tech
Feb 4, 2026 · Artificial Intelligence

How AI Self‑Healing Transforms Mobile UI Automation Testing

This article examines the challenges of manual mobile UI testing, introduces AI‑driven self‑healing techniques that combine multimodal perception, visual models and semantic analysis, and details the architecture, diagnostic workflow, smart popup handling, change‑aware engines, practical results and future directions.

AISoftware qualityUI automation
0 likes · 15 min read
How AI Self‑Healing Transforms Mobile UI Automation Testing
Raymond Ops
Raymond Ops
Jan 28, 2026 · Artificial Intelligence

From Alert Storms to Smart Ops: Unlocking AIOps for Modern IT Operations

This guide walks through the evolution from noisy alert storms to intelligent AIOps, covering AIOps fundamentals, why it matters now, core capabilities like anomaly detection, root‑cause analysis, capacity forecasting and self‑healing, a practical implementation roadmap, toolchain suggestions, common pitfalls, and future trends.

Capacity PredictionRoot Cause Analysisaiops
0 likes · 22 min read
From Alert Storms to Smart Ops: Unlocking AIOps for Modern IT Operations
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Oct 16, 2025 · Operations

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

This article explains how Huawei Cloud's HyperRouter implements deterministic operations through a combination of L4/L7 load‑balancing co‑design, high‑performance data‑plane choices, self‑healing mechanisms, point‑to‑point architecture, Cell + Shuffle‑Sharding isolation, and user‑centric observability, providing a reproducible blueprint for reliable cloud services.

Cloud NativeDPDKObservability
0 likes · 17 min read
How HyperRouter Enables Deterministic Operations for L4 Load Balancing
MaGe Linux Operations
MaGe Linux Operations
Sep 12, 2025 · Operations

From Alert Storms to Intelligent Ops: A Practical AIOps Journey

This article explores how AIOps transforms traditional IT operations by using AI for anomaly detection, root‑cause analysis, capacity forecasting, and self‑healing, offering a step‑by‑step roadmap, real‑world code examples, toolchain recommendations, common pitfalls, and future trends for building intelligent, automated operations.

Root Cause Analysisaiopsanomaly detection
0 likes · 24 min read
From Alert Storms to Intelligent Ops: A Practical AIOps Journey
Qunar Tech Salon
Qunar Tech Salon
Sep 1, 2025 · Databases

Redesigning Database Monitoring: From Push to Pull for Smarter Alerts

This article analyzes the shortcomings of the legacy database monitoring system, explains the transition from a push‑based to a pull‑based architecture, outlines comprehensive metric collection, intelligent alert strategies, and self‑healing mechanisms, and showcases the performance improvements achieved with the new solution.

AlertingDatabase MonitoringPrometheus
0 likes · 25 min read
Redesigning Database Monitoring: From Push to Pull for Smarter Alerts
Advanced AI Application Practice
Advanced AI Application Practice
Aug 19, 2025 · Frontend Development

How AI Overcomes Enterprise UI Automation Testing Pain Points

The article examines the inherent drawbacks of traditional UI automation—selector dependence, fragility, extra development overhead, limited support for Canvas/SVG, unreadable reports, and steep learning curves—and shows how the AI‑driven Midscene.js framework addresses each issue with semantic element location, intelligent fault tolerance, zero‑code instrumentation, multimodal element recognition, business‑semantic reporting, and flexible development modes, outperforming conventional tools like Browser Use.

AI testingBrowser UseMidscene.js
0 likes · 10 min read
How AI Overcomes Enterprise UI Automation Testing Pain Points
Cognitive Technology Team
Cognitive Technology Team
Nov 14, 2024 · Operations

Designing Self‑Healing Applications for Fault Tolerance in Distributed Systems

To ensure distributed applications can recover automatically from hardware, network, or service failures, this guide outlines three core capabilities—fault detection, graceful handling, and monitoring—plus practical strategies such as asynchronous component separation, retries, circuit breakers, isolation, load shedding, failover, compensation, checkpointing, graceful degradation, rate limiting, leader election, fault injection, chaos engineering, and use of availability zones.

Cloud NativeDistributed SystemsOperations
0 likes · 7 min read
Designing Self‑Healing Applications for Fault Tolerance in Distributed Systems
ByteDance SYS Tech
ByteDance SYS Tech
May 9, 2024 · Operations

How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations

The presentation explains how large‑model agents empower AIOps by automating routine tasks, enhancing anomaly detection, fault diagnosis, and remediation, while outlining architectural components, multi‑agent collaboration, and future directions for building self‑healing, observability‑driven operations platforms.

AgentObservabilityOperations Automation
0 likes · 15 min read
How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations
Efficient Ops
Efficient Ops
Nov 8, 2023 · Operations

How Intelligent Operations (AIOps) Transforms IT Management and Self‑Healing

This article explains what intelligent operations (AIOps) are, outlines a four‑layer platform architecture, and showcases real‑world practices such as load‑balancing link repair, MySQL container self‑healing, composite service tracing, component‑based orchestration, and AI‑driven log analysis, concluding with future prospects.

AutomationIT OperationsIntelligent Operations
0 likes · 7 min read
How Intelligent Operations (AIOps) Transforms IT Management and Self‑Healing
21CTO
21CTO
Jun 18, 2023 · Artificial Intelligence

Can AI Self‑Healing Code Revolutionize Software Development?

The article explores how generative AI and large language models are enabling automated code creation, self‑repair, and continuous‑integration workflows, while highlighting challenges in code quality, industry experiments at Google and Stack Overflow, and the future impact on developers and software engineering practices.

AICode Generationci/cd
0 likes · 12 min read
Can AI Self‑Healing Code Revolutionize Software Development?
Baidu Geek Talk
Baidu Geek Talk
Mar 29, 2023 · Cloud Native

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Punica provides a cloud‑native, one‑stop platform that unifies Baidu’s content‑understanding inference services, automates testing, resource provisioning, and monitoring, and enables unattended, self‑healing operations with dynamic scaling and GPU scheduling, cutting onboarding time by half and reclaiming hundreds of GPUs.

AI inferenceInference PlatformService Orchestration
0 likes · 14 min read
Punica: A Cloud‑Native Platform for Content Understanding Inference Services
Tencent Cloud Developer
Tencent Cloud Developer
Dec 26, 2022 · Cloud Native

Challenges and Optimization Strategies for Containerized Deployment of Online Services on Kubernetes

Tencent’s shift from VMs to Kubernetes for massive online services faces pod‑size rigidity, heterogeneous node balancing, elastic scaling, and massive cluster‑pool mapping, prompting optimizations such as dynamic CPU compression, custom load‑aware scheduling, collaborative HPA/VPA scaling, dynamic quota migration, unified routing‑sync, and an automated decision‑tree‑driven self‑healing workflow for container‑destruction failures.

Dynamic SchedulingKubernetesResource Optimization
0 likes · 12 min read
Challenges and Optimization Strategies for Containerized Deployment of Online Services on Kubernetes
Baidu Geek Talk
Baidu Geek Talk
Dec 20, 2022 · Industry Insights

How AI‑Powered Fault Localization Transforms Automated Testing at Scale

This article explores Baidu's intelligent testing practices, covering spectrum‑based root‑cause localization, error‑code driven build‑system diagnostics, revenue‑change stop‑loss decision workflows, and search UI case‑level tracing, illustrating how data, algorithms, and engineering combine to reduce manual effort and accelerate issue resolution.

Automated TestingFault LocalizationOperations
0 likes · 10 min read
How AI‑Powered Fault Localization Transforms Automated Testing at Scale
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Dec 9, 2022 · Operations

How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming

This article details Alibaba Cloud's Flink Cluster Inspector, explaining the business challenges of hotspot machines, the analysis of resource over‑use, and the four‑stage solution—pre‑profiling, in‑process self‑healing, post‑recovery, and observability—that reduces latency, cuts costs, and improves operational efficiency.

ClusterFlinkHotSpot
0 likes · 19 min read
How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming
Top Architect
Top Architect
Nov 12, 2022 · Cloud Native

Evolution of Ant Financial Service Mesh: Link Encryption, Adaptive Rate Limiting, Fine‑Grained Traffic Steering, and Service Self‑Healing

The article reviews how Ant Financial’s Service Mesh has evolved after its double‑11 rollout, detailing the implementation of link encryption, adaptive rate limiting, fine‑grained traffic steering, and self‑healing mechanisms that improve security, performance, and reliability across large‑scale microservice deployments.

Adaptive Rate LimitingCloud NativeLink Encryption
0 likes · 16 min read
Evolution of Ant Financial Service Mesh: Link Encryption, Adaptive Rate Limiting, Fine‑Grained Traffic Steering, and Service Self‑Healing
DevOps
DevOps
Aug 23, 2022 · Artificial Intelligence

Intelligent Automation Testing: Self‑Healing and Machine‑Learning Techniques

This article reviews the evolution of automated testing toward intelligent solutions, explaining self‑healing mechanisms, machine‑learning‑driven object recognition, computer‑vision and OCR approaches, industry tools such as Healenium and Airtest, and future prospects for zero‑code AI‑powered test automation.

AIComputer VisionOCR
0 likes · 13 min read
Intelligent Automation Testing: Self‑Healing and Machine‑Learning Techniques
Baidu Intelligent Testing
Baidu Intelligent Testing
Jun 30, 2022 · Operations

Intelligent Test Execution: Risk‑Based Manual Case Recommendation, Parallel‑Coverage Traffic Selection, Smart Build, Priority‑Based Task Scheduling, and UI Automation Self‑Healing

This article presents a comprehensive overview of intelligent test execution techniques, including risk‑based manual test case recommendation, parallel‑coverage traffic filtering, dynamic smart build strategies, priority‑driven task scheduling, and UI automation self‑healing, illustrating how these methods improve testing efficiency, coverage, and stability.

ci/cdintelligent testingrisk-based recommendation
0 likes · 11 min read
Intelligent Test Execution: Risk‑Based Manual Case Recommendation, Parallel‑Coverage Traffic Selection, Smart Build, Priority‑Based Task Scheduling, and UI Automation Self‑Healing
Efficient Ops
Efficient Ops
Mar 28, 2022 · Operations

Zhejiang Mobile’s AI‑Driven Self‑Healing: Pioneering Intelligent Network Operations

This article examines the challenges of intelligent telecom network operation, presents Zhejiang Mobile’s AI‑powered self‑healing practice—including process re‑design, system reconstruction, talent transformation, and measurable results—and outlines the AIOps maturity model and future outlook for digital network management.

Digital Transformationaiopsnetwork automation
0 likes · 11 min read
Zhejiang Mobile’s AI‑Driven Self‑Healing: Pioneering Intelligent Network Operations
HomeTech
HomeTech
Dec 30, 2021 · Operations

Open-falcon in Automotive Home: Application, Architecture, and Customizations

This article describes how the open‑falcon monitoring system is applied and customized at Automotive Home, covering its architecture, component roles, a comparison with other open‑source solutions, and the enhancements made for service‑tree based dynamic monitoring, alerting, self‑healing, and high‑availability deployment.

Open-FalconOperationsmonitoring
0 likes · 11 min read
Open-falcon in Automotive Home: Application, Architecture, and Customizations
dbaplus Community
dbaplus Community
Jul 26, 2020 · Big Data

How Prometheus Powers Scalable Monitoring for Massive Big Data Clusters

Facing thousands of nodes in expanding big‑data clusters, the author evaluates legacy monitoring stacks, selects Prometheus + Alertmanager + Grafana, and details its architecture, custom exporters, real‑time alerts, self‑healing mechanisms, and visual dashboards that now support ten large clusters and dozens of services.

AlertmanagerBig DataGrafana
0 likes · 11 min read
How Prometheus Powers Scalable Monitoring for Massive Big Data Clusters
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 13, 2019 · Cloud Native

Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained

Ant Financial’s article details how its large‑scale Kubernetes management system—built on a meta‑cluster, end‑state operators, and a Kube‑on‑Kube design—ensures reliable creation, upgrade, and self‑healing of thousands of nodes, while providing gray‑scale changes, risk assessment, and fault‑tolerant automation.

Cloud NativeCluster ManagementKubernetes
0 likes · 15 min read
Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
Nov 5, 2019 · Operations

How 360 Scaled AIOps: From Data to Self‑Healing Operations

At the 360 Internet Technology Training Camp, experts detailed how AI-driven AIOps can transform large‑scale operations, covering data collection, model‑based anomaly detection, alert correlation, self‑healing workflows, and visual dashboards, and presented a practical end‑to‑end framework that other companies can adopt quickly.

Big DataOperationsaiops
0 likes · 15 min read
How 360 Scaled AIOps: From Data to Self‑Healing Operations
360 Tech Engineering
360 Tech Engineering
Oct 31, 2019 · Operations

AIOps Implementation Practice at 360: Architecture, Models, and Automation

The article details 360's AIOps deployment, covering external speaker insights, internal architecture, data collection pipelines, AI models for resource recycling, alarm reduction, and correlation, as well as visualization dashboards, labeling platforms, and self‑healing mechanisms, illustrating a comprehensive AI‑driven operations framework.

AI MonitoringKnowledge GraphOperations Automation
0 likes · 14 min read
AIOps Implementation Practice at 360: Architecture, Models, and Automation
Alibaba Cloud Native
Alibaba Cloud Native
Oct 27, 2019 · Cloud Native

How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—using end‑state driven operators, custom CRDs, self‑healing mechanisms, and risk‑mitigation strategies—to reliably run thousands of nodes and dozens of business clusters in production.

Cluster ManagementKube-on-KubeKubernetes
0 likes · 15 min read
How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System
360 Tech Engineering
360 Tech Engineering
Sep 6, 2019 · Operations

StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing

This article introduces a StackStorm‑driven ChatOps framework that consolidates monitoring alerts, applies rule‑based root‑cause analysis, and automatically executes self‑healing actions, outlining its architecture, components, workflow definitions, and practical deployment results within an enterprise operations environment.

ChatOpsOperations AutomationStackStorm
0 likes · 6 min read
StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing
AntTech
AntTech
Aug 15, 2019 · Cloud Native

Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System

This article explains how Ant Financial designs a highly reliable, end‑state‑driven Kubernetes management platform that handles lifecycle operations, node self‑healing, and risk‑controlled changes for clusters with tens of thousands of nodes, using operators, custom resources, and a meta‑cluster architecture.

Cluster ManagementKuberneteslarge scale
0 likes · 9 min read
Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System
58 Tech
58 Tech
Mar 25, 2019 · Operations

Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform

The article describes how the 58 monitoring platform reduces alarm storms through alarm convergence, intelligent merging using Gini‑based decision trees, and automated self‑healing, thereby improving alert quality, cutting noise by about 70%, and helping engineers resolve incidents faster.

Operationsalarm convergencealert merging
0 likes · 9 min read
Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform
Efficient Ops
Efficient Ops
Nov 27, 2018 · Operations

How Alibaba Automates Server Fault Detection and Self‑Healing at Scale

Alibaba’s massive data‑center operations face growing hardware failures, so they built the DAM (Dammo) platform that integrates Tianji management, predictive fault detection, automated remediation, and self‑balancing cluster reconstruction, achieving near‑complete hardware issue coverage and reducing manual intervention across hundreds of thousands of servers.

Operationsaiopscloud computing
0 likes · 17 min read
How Alibaba Automates Server Fault Detection and Self‑Healing at Scale
Alibaba Cloud Developer
Alibaba Cloud Developer
Nov 19, 2018 · Operations

How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale

This article explains how Alibaba’s massive data‑center operations detect hardware failures early, automatically isolate faulty servers, and execute self‑healing workflows through a centralized, cloud‑native platform, detailing detection methods, convergence rules, architecture evolution, and the benefits of a closed‑loop AIOps system.

Operationsaiopscloud-native
0 likes · 15 min read
How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale
MaGe Linux Operations
MaGe Linux Operations
May 16, 2018 · Operations

How to Build an Automated Fault‑Healing System for Enterprise Ops

This article explores the end‑to‑end design of an enterprise‑grade fault‑self‑healing solution, covering the basic workflow, abstraction of alert handling, CMDB‑based resource mapping, internal gateway integration, monitoring platform adapters like Zabbix and Open‑Falcon, convergence logic, complex alarm orchestration, and the overall technical architecture.

CMDBaiopsfault automation
0 likes · 9 min read
How to Build an Automated Fault‑Healing System for Enterprise Ops
Suning Technology
Suning Technology
Nov 20, 2017 · Big Data

How ZEUS Turns Monitoring Data into Automated Decisions for Enterprise Systems

ZEUS, Suning’s decision analysis platform, integrates monitoring data from tools like Baymax and HIRO, applies CEP aggregation and Drools rule evaluation, and leverages big‑data storage and machine‑learning models to automatically identify root causes, provide real‑time alerts, and enable self‑healing in large‑scale distributed systems.

Big Datadecision analysisrule engine
0 likes · 14 min read
How ZEUS Turns Monitoring Data into Automated Decisions for Enterprise Systems
Qunar Tech Salon
Qunar Tech Salon
Jun 16, 2017 · Operations

OpsRobot: Chatbot‑Based Operations Automation Platform Overview

OpsRobot integrates development tools into a chat‑based interface, using custom plugins and APIs to automate low‑efficiency, error‑prone operational tasks, thereby streamlining workflows, improving efficiency, and enabling future capabilities such as self‑healing and automated scaling.

ChatbotOps Automationapi-gateway
0 likes · 5 min read
OpsRobot: Chatbot‑Based Operations Automation Platform Overview
Efficient Ops
Efficient Ops
Apr 19, 2016 · Operations

How Tencent’s Blue Whale Powers Unattended Ops, SaaS Automation, and DevOps Value

The talk outlines Tencent’s Blue Whale platform, describing how automated publishing tools, unattended change processes, fault‑handling strategies, alert‑driven self‑healing, low‑cost tool culture, and a thriving DevOps ecosystem together transform operations from routine maintenance to high‑value, scalable services.

DevOpsSaaSTool Culture
0 likes · 12 min read
How Tencent’s Blue Whale Powers Unattended Ops, SaaS Automation, and DevOps Value