Tagged articles

self‑healing

58 articles · Page 1 of 1

Jun 14, 2026 · Artificial Intelligence

Claude Code’s New Self‑Healing Feature Crushes Six Developer Nightmares

Claude Code’s latest overhaul tackles six common developer frustrations—terminal flicker, silent stalls, cryptic errors, context deadlocks, flaky connections, and session crashes—by introducing a full‑screen TUI, streaming reasoning, clearer error messages, smarter context compression, a hardened MCP layer, and automatic self‑healing.

AI programmingClaude CodeMCP robustness

0 likes · 7 min read

Claude Code’s New Self‑Healing Feature Crushes Six Developer Nightmares

AI Engineering

Jun 6, 2026 · Artificial Intelligence

Introducing /supergoal: A Planning Layer for Claude Code and Codex CLI

The /supergoal plugin adds a planning and self‑healing layer to Claude Code’s /goal engine, automatically generating specifications, risk analysis, staged roadmaps, and audit checks, so developers can issue a single high‑level command and let the system handle task decomposition, execution, and verification.

AutomationClaude Codeai programming assistant

0 likes · 7 min read

Introducing /supergoal: A Planning Layer for Claude Code and Codex CLI

ITPUB

May 11, 2026 · Databases

What Human Evolution Teaches About IT Architecture Trade‑offs (Ahead of the 2026 SACC)

The article draws a detailed analogy between millions of years of human evolution—standing up, shedding hair, expanding the brain, and recruiting ancient bacteria—and modern IT architecture, showing how each design choice brings hidden costs, why perfect systems are impossible, and how embracing trade‑offs, extensions, and continuous iteration can lead to resilient, self‑healing databases.

Cloud NativeDatabasesarchitecture

0 likes · 22 min read

What Human Evolution Teaches About IT Architecture Trade‑offs (Ahead of the 2026 SACC)

DeepHub IMBA

May 7, 2026 · Frontend Development

Self‑Healing Playwright Tests with LLM‑Driven Locator Recovery

This article shows how to combine Playwright with an LLM (Groq) to build a self‑healing test framework that detects broken selectors, extracts a trimmed DOM snapshot, asks the model for a replacement locator, validates confidence, caches results, and integrates the logic via a Playwright fixture.

GroqJavaScriptLLM

0 likes · 17 min read

Self‑Healing Playwright Tests with LLM‑Driven Locator Recovery

FunTester

Apr 28, 2026 · Operations

How Self‑Healing Automation Platforms Transform SRE Practices

The article explains how a self‑healing platform improves SRE reliability by reducing MTTR, preserving error‑budget, automating high‑impact incident remediation, enforcing safety guardrails, and shifting team focus from firefighting to sustainable reliability engineering.

AutomationError BudgetMTTR

0 likes · 10 min read

How Self‑Healing Automation Platforms Transform SRE Practices

FunTester

Apr 27, 2026 · Operations

Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help

The article explains that large‑scale incidents overwhelm on‑call engineers who must manually piece together context from countless signals, and shows how a self‑healing automation platform can take over repetitive, known failure patterns, verify fixes, and reduce fatigue while keeping humans in the loop for oversight.

AutomationOperationsPlatform Engineering

0 likes · 8 min read

Why Relying on Humans for Incident Recovery Fails and How Self‑Healing Automation Platforms Help

Ray's Galactic Tech

Apr 24, 2026 · Backend Development

Self‑Healing Agents: Rebuilding a High‑Concurrency Travel System with Spring AI ReAct

This article details how a legacy travel‑booking service was transformed into a production‑grade, self‑healing agent system using Spring AI ReAct and multi‑tool coordination, covering architectural redesign, tool governance, error semantics, high‑concurrency safeguards, observability, security, and real‑world performance gains.

AgentHigh concurrencyReAct

0 likes · 31 min read

Self‑Healing Agents: Rebuilding a High‑Concurrency Travel System with Spring AI ReAct

Woodpecker Software Testing

Apr 24, 2026 · Operations

Self-Healing UI Test Scripts: Boost Performance and Reliability

The article explains how fragile UI automation scripts hinder performance testing and shows a three‑layer self‑healing approach using Playwright and Python that reduces script failures, cuts maintenance time, and integrates with monitoring to quickly detect UI performance issues.

AutomationMonitoringPlaywright

0 likes · 7 min read

Self-Healing UI Test Scripts: Boost Performance and Reliability

Woodpecker Software Testing

Apr 18, 2026 · Operations

Deep Dive into Performance Optimization for Self‑Healing Test Scripts

The article examines why self‑healing test scripts increase runtime overhead, breaks down the underlying mechanisms, and presents four concrete optimization tactics—layered healing, locator caching, visual/semantic throttling, and asynchronous repair—backed by real‑world case data showing up to 43% faster regressions and 52% lower maintenance cost.

CI/CDPerformance OptimizationUI testing

0 likes · 8 min read

Deep Dive into Performance Optimization for Self‑Healing Test Scripts

StarRocks

Apr 16, 2026 · Databases

Why Traditional Databases Stall AI Agents—and How StarRocks Overcomes the Bottleneck

Traditional databases were built for low‑frequency, human‑driven queries, but AI agents generate dozens of concurrent, sub‑second queries that expose architectural limits, and StarRocks addresses these challenges with self‑healing optimization, real‑time data pipelines, extreme concurrency handling, and seamless lakehouse access.

Database ConcurrencyLakehouseQuery Optimization

0 likes · 13 min read

Why Traditional Databases Stall AI Agents—and How StarRocks Overcomes the Bottleneck

DevOps Coach

Apr 15, 2026 · Cloud Computing

How We Scaled to 6,000 AWS Accounts with a 3‑Engineer Team: A Self‑Healing Automation Blueprint

This article details how a SaaS platform transformed its AWS multi‑account management from manual, toil‑heavy processes to a fully automated, self‑healing system that now handles over 6,000 accounts with just three engineers, achieving sub‑5‑minute provisioning, 99.8% compliance, and massive cost savings.

AWSAutomationMulti-Account

0 likes · 15 min read

How We Scaled to 6,000 AWS Accounts with a 3‑Engineer Team: A Self‑Healing Automation Blueprint

Woodpecker Software Testing

Apr 15, 2026 · Artificial Intelligence

How AI Testing Tools Redefine Performance Optimization: A New Paradigm

Amid exploding large‑model deployments, AI teams struggle with slow test feedback, but AI‑native testing tools—through intelligent load modeling, inference‑layer root‑cause analysis, and self‑healing loops—demonstrate concrete latency reductions, resource savings, and faster issue remediation.

AI testingMLOpsObservability

0 likes · 6 min read

How AI Testing Tools Redefine Performance Optimization: A New Paradigm

Woodpecker Software Testing

Mar 18, 2026 · Operations

How Self‑Healing UI Test Scripts Boost Performance Testing Reliability

The article explains why traditional UI automation scripts break under high‑load performance testing and presents a deterministic, three‑level self‑healing framework—locator elasticity, timing adaptation, and flexible assertions—implemented with Python + Playwright in a banking transaction system, raising script stability from 41 % to 96.5 % at 5 k TPS.

JMeterPlaywrightPython

0 likes · 8 min read

How Self‑Healing UI Test Scripts Boost Performance Testing Reliability

Woodpecker Software Testing

Mar 17, 2026 · Operations

How Self‑Healing Test Scripts Make UI Automation Truly Live

The article explains why traditional UI automation scripts break on minor UI changes, introduces self‑healing test scripts that combine detection, analysis, repair and verification layers, compares commercial, framework‑enhanced and in‑house implementations, and outlines three common pitfalls to avoid for reliable, resilient test automation.

CIPlaywrightUI automation

0 likes · 9 min read

How Self‑Healing Test Scripts Make UI Automation Truly Live

ShiZhen AI

Mar 11, 2026 · Artificial Intelligence

Build Persistent AI Agents with OpenClaw: A 40‑Day Hands‑On Guide

This article details a 40‑day workflow for creating and evolving eight continuous‑running OpenClaw AI agents using a three‑layer markdown file system—Identity, Operations, and Knowledge—showing how to give agents long‑term memory, self‑healing checks, and coordinated collaboration without databases or message queues.

AI agentsAgent CoordinationOpenClaw

0 likes · 17 min read

Build Persistent AI Agents with OpenClaw: A 40‑Day Hands‑On Guide

AI Tech Publishing

Mar 9, 2026 · Artificial Intelligence

How to Build a 24‑7 Autonomous AI Agent Team with OpenClaw

This guide walks through setting up a continuously running AI Agent Team using OpenClaw, covering hardware choices, installation, file structure, agent roles, coordination via markdown files, scheduling, self‑healing cron jobs, security, cost, troubleshooting, and step‑by‑step recommendations for incremental deployment.

AI agentsAutomationCron scheduling

0 likes · 20 min read

How to Build a 24‑7 Autonomous AI Agent Team with OpenClaw

Woodpecker Software Testing

Mar 3, 2026 · Operations

Self-Healing Test Scripts: End Frequent Maintenance Hassles

The article explains how self‑healing test scripts, built on observable snapshots, strategy libraries, and lightweight decision engines, can automatically detect UI changes, diagnose locator failures, and apply semantic or visual fixes, dramatically reducing maintenance time and manual intervention in fast‑paced continuous delivery environments.

ObservabilityPythonSelenium

0 likes · 7 min read

Self-Healing Test Scripts: End Frequent Maintenance Hassles

Huolala Tech

Feb 4, 2026 · Artificial Intelligence

How AI Self‑Healing Transforms Mobile UI Automation Testing

This article examines the challenges of manual mobile UI testing, introduces AI‑driven self‑healing techniques that combine multimodal perception, visual models and semantic analysis, and details the architecture, diagnostic workflow, smart popup handling, change‑aware engines, practical results and future directions.

AIMultimodalUI automation

0 likes · 15 min read

How AI Self‑Healing Transforms Mobile UI Automation Testing

Raymond Ops

Jan 28, 2026 · Artificial Intelligence

From Alert Storms to Smart Ops: Unlocking AIOps for Modern IT Operations

This guide walks through the evolution from noisy alert storms to intelligent AIOps, covering AIOps fundamentals, why it matters now, core capabilities like anomaly detection, root‑cause analysis, capacity forecasting and self‑healing, a practical implementation roadmap, toolchain suggestions, common pitfalls, and future trends.

AIOpsAnomaly DetectionRoot Cause Analysis

0 likes · 22 min read

From Alert Storms to Smart Ops: Unlocking AIOps for Modern IT Operations

Huawei Cloud Developer Alliance

Oct 16, 2025 · Operations

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

This article explains how Huawei Cloud's HyperRouter implements deterministic operations through a combination of L4/L7 load‑balancing co‑design, high‑performance data‑plane choices, self‑healing mechanisms, point‑to‑point architecture, Cell + Shuffle‑Sharding isolation, and user‑centric observability, providing a reproducible blueprint for reliable cloud services.

Cloud NativeDPDKObservability

0 likes · 17 min read

How HyperRouter Enables Deterministic Operations for L4 Load Balancing

Linux Ops Smart Journey

Oct 10, 2025 · Operations

How to Detect and Auto‑Heal Node Failures in Kubernetes with Node Problem Detector

This article explains why Kubernetes nodes need deeper health monitoring, introduces the Node Problem Detector (NPD) component, outlines its detection methods, and provides step‑by‑step instructions to deploy, configure, and verify NPD for automatic alerts and self‑healing in a cluster.

Cluster MonitoringKubernetesNode Problem Detector

0 likes · 8 min read

How to Detect and Auto‑Heal Node Failures in Kubernetes with Node Problem Detector

MaGe Linux Operations

Sep 12, 2025 · Operations

From Alert Storms to Intelligent Ops: A Practical AIOps Journey

This article explores how AIOps transforms traditional IT operations by using AI for anomaly detection, root‑cause analysis, capacity forecasting, and self‑healing, offering a step‑by‑step roadmap, real‑world code examples, toolchain recommendations, common pitfalls, and future trends for building intelligent, automated operations.

AIOpsAnomaly DetectionRoot Cause Analysis

0 likes · 24 min read

From Alert Storms to Intelligent Ops: A Practical AIOps Journey

Qunar Tech Salon

Sep 1, 2025 · Databases

Redesigning Database Monitoring: From Push to Pull for Smarter Alerts

This article analyzes the shortcomings of the legacy database monitoring system, explains the transition from a push‑based to a pull‑based architecture, outlines comprehensive metric collection, intelligent alert strategies, and self‑healing mechanisms, and showcases the performance improvements achieved with the new solution.

AlertingDatabase Monitoringmetric collection

0 likes · 25 min read

Redesigning Database Monitoring: From Push to Pull for Smarter Alerts

Advanced AI Application Practice

Aug 19, 2025 · Frontend Development

How AI Overcomes Enterprise UI Automation Testing Pain Points

The article examines the inherent drawbacks of traditional UI automation—selector dependence, fragility, extra development overhead, limited support for Canvas/SVG, unreadable reports, and steep learning curves—and shows how the AI‑driven Midscene.js framework addresses each issue with semantic element location, intelligent fault tolerance, zero‑code instrumentation, multimodal element recognition, business‑semantic reporting, and flexible development modes, outperforming conventional tools like Browser Use.

AI testingBrowser UseMidscene.js

0 likes · 10 min read

How AI Overcomes Enterprise UI Automation Testing Pain Points

MaGe Linux Operations

Jul 30, 2025 · Operations

How to Achieve Zero‑Downtime Self‑Healing on 10,000 Servers with ansible‑pull

Discover how to use Ansible’s ansible‑pull mode to let thousands of servers autonomously detect and fix configuration drift, achieve zero‑downtime repairs, and scale self‑healing automation with Git‑based playbooks, smart execution strategies, monitoring integration, and performance optimizations.

AnsiblePull Modeconfiguration management

0 likes · 15 min read

How to Achieve Zero‑Downtime Self‑Healing on 10,000 Servers with ansible‑pull

php Courses

Jul 3, 2025 · Backend Development

Can PHP Code Self‑Repair? Building a Self‑Healing Error Recovery System

This article explains how PHP can automatically rewrite its own code at runtime to monitor, analyze, and fix errors, offering a proactive alternative to traditional try‑catch error handling and improving system robustness.

Error handlingruntime code modificationself‑healing

0 likes · 6 min read

Can PHP Code Self‑Repair? Building a Self‑Healing Error Recovery System

Cognitive Technology Team

Nov 14, 2024 · Operations

Designing Self‑Healing Applications for Fault Tolerance in Distributed Systems

To ensure distributed applications can recover automatically from hardware, network, or service failures, this guide outlines three core capabilities—fault detection, graceful handling, and monitoring—plus practical strategies such as asynchronous component separation, retries, circuit breakers, isolation, load shedding, failover, compensation, checkpointing, graceful degradation, rate limiting, leader election, fault injection, chaos engineering, and use of availability zones.

Cloud NativeOperationsdistributed systems

0 likes · 7 min read

Designing Self‑Healing Applications for Fault Tolerance in Distributed Systems

ByteDance SYS Tech

May 9, 2024 · Operations

How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations

The presentation explains how large‑model agents empower AIOps by automating routine tasks, enhancing anomaly detection, fault diagnosis, and remediation, while outlining architectural components, multi‑agent collaboration, and future directions for building self‑healing, observability‑driven operations platforms.

AIOpsAgentObservability

0 likes · 15 min read

How Large‑Model Agents Transform AIOps: From Automation to Self‑Healing Operations

Efficient Ops

Nov 8, 2023 · Operations

How Intelligent Operations (AIOps) Transforms IT Management and Self‑Healing

This article explains what intelligent operations (AIOps) are, outlines a four‑layer platform architecture, and showcases real‑world practices such as load‑balancing link repair, MySQL container self‑healing, composite service tracing, component‑based orchestration, and AI‑driven log analysis, concluding with future prospects.

AIOpsAutomationIT Operations

0 likes · 7 min read

How Intelligent Operations (AIOps) Transforms IT Management and Self‑Healing

21CTO

Jun 18, 2023 · Artificial Intelligence

Can AI Self‑Healing Code Revolutionize Software Development?

The article explores how generative AI and large language models are enabling automated code creation, self‑repair, and continuous‑integration workflows, while highlighting challenges in code quality, industry experiments at Google and Stack Overflow, and the future impact on developers and software engineering practices.

AICI/CDcode generation

0 likes · 12 min read

Can AI Self‑Healing Code Revolutionize Software Development?

Ops Development Stories

Apr 19, 2023 · Operations

Mastering Alert Management with Nightingale: Rules, Silencing, Escalation, and Self‑Healing

Learn how to efficiently configure Nightingale’s alert rules, silence unwanted alerts, set up escalation policies, and implement self‑healing scripts using ibex, with step‑by‑step guidance, screenshots, and practical tips for robust monitoring in cloud‑native environments.

NightingaleOperationsibex

0 likes · 11 min read

Mastering Alert Management with Nightingale: Rules, Silencing, Escalation, and Self‑Healing

Baidu Geek Talk

Mar 29, 2023 · Cloud Native

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Punica provides a cloud‑native, one‑stop platform that unifies Baidu’s content‑understanding inference services, automates testing, resource provisioning, and monitoring, and enables unattended, self‑healing operations with dynamic scaling and GPU scheduling, cutting onboarding time by half and reclaiming hundreds of GPUs.

AI inferenceInference PlatformResource Scheduling

0 likes · 14 min read

Punica: A Cloud‑Native Platform for Content Understanding Inference Services

Tencent Cloud Developer

Dec 26, 2022 · Cloud Native

Challenges and Optimization Strategies for Containerized Deployment of Online Services on Kubernetes

Tencent’s shift from VMs to Kubernetes for massive online services faces pod‑size rigidity, heterogeneous node balancing, elastic scaling, and massive cluster‑pool mapping, prompting optimizations such as dynamic CPU compression, custom load‑aware scheduling, collaborative HPA/VPA scaling, dynamic quota migration, unified routing‑sync, and an automated decision‑tree‑driven self‑healing workflow for container‑destruction failures.

Dynamic SchedulingKubernetescontainerization

0 likes · 12 min read

Challenges and Optimization Strategies for Containerized Deployment of Online Services on Kubernetes

Baidu Geek Talk

Dec 20, 2022 · Industry Insights

How AI‑Powered Fault Localization Transforms Automated Testing at Scale

This article explores Baidu's intelligent testing practices, covering spectrum‑based root‑cause localization, error‑code driven build‑system diagnostics, revenue‑change stop‑loss decision workflows, and search UI case‑level tracing, illustrating how data, algorithms, and engineering combine to reduce manual effort and accelerate issue resolution.

Fault LocalizationOperationsautomated testing

0 likes · 10 min read

How AI‑Powered Fault Localization Transforms Automated Testing at Scale

Alibaba Cloud Big Data AI Platform

Dec 9, 2022 · Operations

How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming

This article details Alibaba Cloud's Flink Cluster Inspector, explaining the business challenges of hotspot machines, the analysis of resource over‑use, and the four‑stage solution—pre‑profiling, in‑process self‑healing, post‑recovery, and observability—that reduces latency, cuts costs, and improves operational efficiency.

FlinkHotSpotOperations

0 likes · 19 min read

How Alibaba’s Flink Cluster Inspector Eliminates Hotspot Machines in Real‑Time Streaming

Top Architect

Nov 12, 2022 · Cloud Native

Evolution of Ant Financial Service Mesh: Link Encryption, Adaptive Rate Limiting, Fine‑Grained Traffic Steering, and Service Self‑Healing

The article reviews how Ant Financial’s Service Mesh has evolved after its double‑11 rollout, detailing the implementation of link encryption, adaptive rate limiting, fine‑grained traffic steering, and self‑healing mechanisms that improve security, performance, and reliability across large‑scale microservice deployments.

Adaptive Rate LimitingCloud NativeLink Encryption

0 likes · 16 min read

Evolution of Ant Financial Service Mesh: Link Encryption, Adaptive Rate Limiting, Fine‑Grained Traffic Steering, and Service Self‑Healing

DevOps

Aug 23, 2022 · Artificial Intelligence

Intelligent Automation Testing: Self‑Healing and Machine‑Learning Techniques

This article reviews the evolution of automated testing toward intelligent solutions, explaining self‑healing mechanisms, machine‑learning‑driven object recognition, computer‑vision and OCR approaches, industry tools such as Healenium and Airtest, and future prospects for zero‑code AI‑powered test automation.

AIAutomation testingOCR

0 likes · 13 min read

Intelligent Automation Testing: Self‑Healing and Machine‑Learning Techniques

Baidu Intelligent Testing

Jun 30, 2022 · Operations

Intelligent Test Execution: Risk‑Based Manual Case Recommendation, Parallel‑Coverage Traffic Selection, Smart Build, Priority‑Based Task Scheduling, and UI Automation Self‑Healing

This article presents a comprehensive overview of intelligent test execution techniques, including risk‑based manual test case recommendation, parallel‑coverage traffic filtering, dynamic smart build strategies, priority‑driven task scheduling, and UI automation self‑healing, illustrating how these methods improve testing efficiency, coverage, and stability.

CI/CDIntelligent Testingrisk-based recommendation

0 likes · 11 min read

Intelligent Test Execution: Risk‑Based Manual Case Recommendation, Parallel‑Coverage Traffic Selection, Smart Build, Priority‑Based Task Scheduling, and UI Automation Self‑Healing

Efficient Ops

Mar 28, 2022 · Operations

Zhejiang Mobile’s AI‑Driven Self‑Healing: Pioneering Intelligent Network Operations

This article examines the challenges of intelligent telecom network operation, presents Zhejiang Mobile’s AI‑powered self‑healing practice—including process re‑design, system reconstruction, talent transformation, and measurable results—and outlines the AIOps maturity model and future outlook for digital network management.

AIOpsTelecomdigital transformation

0 likes · 11 min read

Zhejiang Mobile’s AI‑Driven Self‑Healing: Pioneering Intelligent Network Operations

HomeTech

Dec 30, 2021 · Operations

Open-falcon in Automotive Home: Application, Architecture, and Customizations

This article describes how the open‑falcon monitoring system is applied and customized at Automotive Home, covering its architecture, component roles, a comparison with other open‑source solutions, and the enhancements made for service‑tree based dynamic monitoring, alerting, self‑healing, and high‑availability deployment.

MonitoringOpen-FalconOperations

0 likes · 11 min read

Open-falcon in Automotive Home: Application, Architecture, and Customizations

Alibaba Cloud Native

Mar 10, 2021 · Cloud Native

How Alibaba’s KubeNode Transforms Massive Node Operations with Cloud‑Native Operators

Alibaba’s KubeNode platform tackles the challenges of massive, heterogeneous node fleets by using Kubernetes CRDs and custom operators to provide declarative lifecycle management, automated component upgrades, and rapid fault self‑healing across hundreds of clusters and millions of containers.

Alibaba CloudKubeNodeKubernetes Operators

0 likes · 13 min read

How Alibaba’s KubeNode Transforms Massive Node Operations with Cloud‑Native Operators

dbaplus Community

Jul 26, 2020 · Big Data

How Prometheus Powers Scalable Monitoring for Massive Big Data Clusters

Facing thousands of nodes in expanding big‑data clusters, the author evaluates legacy monitoring stacks, selects Prometheus + Alertmanager + Grafana, and details its architecture, custom exporters, real‑time alerts, self‑healing mechanisms, and visual dashboards that now support ten large clusters and dozens of services.

AlertmanagerBig Datagrafana

0 likes · 11 min read

How Prometheus Powers Scalable Monitoring for Massive Big Data Clusters

Alibaba Cloud Developer

Nov 13, 2019 · Cloud Native

Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained

Ant Financial’s article details how its large‑scale Kubernetes management system—built on a meta‑cluster, end‑state operators, and a Kube‑on‑Kube design—ensures reliable creation, upgrade, and self‑healing of thousands of nodes, while providing gray‑scale changes, risk assessment, and fault‑tolerant automation.

Cloud NativeKubernetesLarge Scale

0 likes · 15 min read

Scaling Kubernetes at Ant Financial: The Kube‑on‑Kube Architecture Explained

360 Zhihui Cloud Developer

Nov 5, 2019 · Operations

How 360 Scaled AIOps: From Data to Self‑Healing Operations

At the 360 Internet Technology Training Camp, experts detailed how AI-driven AIOps can transform large‑scale operations, covering data collection, model‑based anomaly detection, alert correlation, self‑healing workflows, and visual dashboards, and presented a practical end‑to‑end framework that other companies can adopt quickly.

AIOpsBig DataOperations

0 likes · 15 min read

How 360 Scaled AIOps: From Data to Self‑Healing Operations

dbaplus Community

Nov 4, 2019 · Cloud Native

How Ant Financial Scales Kubernetes: Inside Their Cloud‑Native Cluster Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—detailing its architecture, core operators, desired‑state controllers, fault‑self‑healing mechanisms, risk mitigation, and practical Q&A for production environments.

AutomationCloud NativeKubernetes

0 likes · 16 min read

How Ant Financial Scales Kubernetes: Inside Their Cloud‑Native Cluster Management System

360 Tech Engineering

Oct 31, 2019 · Operations

AIOps Implementation Practice at 360: Architecture, Models, and Automation

The article details 360's AIOps deployment, covering external speaker insights, internal architecture, data collection pipelines, AI models for resource recycling, alarm reduction, and correlation, as well as visualization dashboards, labeling platforms, and self‑healing mechanisms, illustrating a comprehensive AI‑driven operations framework.

AI monitoringAIOpsIncident Management

0 likes · 14 min read

AIOps Implementation Practice at 360: Architecture, Models, and Automation

Alibaba Cloud Native

Oct 27, 2019 · Cloud Native

How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

This article explains how Ant Financial designs and operates a large‑scale, highly available Kubernetes management platform—using end‑state driven operators, custom CRDs, self‑healing mechanisms, and risk‑mitigation strategies—to reliably run thousands of nodes and dozens of business clusters in production.

Kube-on-KubeKubernetesLarge Scale

0 likes · 15 min read

How Ant Financial Scales Kubernetes: Inside Their Kube‑on‑Kube Management System

360 Tech Engineering

Sep 6, 2019 · Operations

StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing

This article introduces a StackStorm‑driven ChatOps framework that consolidates monitoring alerts, applies rule‑based root‑cause analysis, and automatically executes self‑healing actions, outlining its architecture, components, workflow definitions, and practical deployment results within an enterprise operations environment.

ChatOpsMonitoringOperations Automation

0 likes · 6 min read

StackStorm-Based ChatOps Solution for Automated Monitoring Alert Self‑Healing

AntTech

Aug 15, 2019 · Cloud Native

Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System

This article explains how Ant Financial designs a highly reliable, end‑state‑driven Kubernetes management platform that handles lifecycle operations, node self‑healing, and risk‑controlled changes for clusters with tens of thousands of nodes, using operators, custom resources, and a meta‑cluster architecture.

KubernetesLarge Scalecluster management

0 likes · 9 min read

Design and Implementation of Ant Financial’s Large‑Scale Kubernetes Cluster Management System

58 Tech

Mar 25, 2019 · Operations

Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform

The article describes how the 58 monitoring platform reduces alarm storms through alarm convergence, intelligent merging using Gini‑based decision trees, and automated self‑healing, thereby improving alert quality, cutting noise by about 70%, and helping engineers resolve incidents faster.

MonitoringOperationsalarm convergence

0 likes · 9 min read

Alarm Convergence, Merging, and Self‑Healing in the 58 Monitoring Platform

Efficient Ops

Nov 27, 2018 · Operations

How Alibaba Automates Server Fault Detection and Self‑Healing at Scale

Alibaba’s massive data‑center operations face growing hardware failures, so they built the DAM (Dammo) platform that integrates Tianji management, predictive fault detection, automated remediation, and self‑balancing cluster reconstruction, achieving near‑complete hardware issue coverage and reducing manual intervention across hundreds of thousands of servers.

AIOpsCloud ComputingOperations

0 likes · 17 min read

How Alibaba Automates Server Fault Detection and Self‑Healing at Scale

Alibaba Cloud Developer

Nov 19, 2018 · Operations

How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale

This article explains how Alibaba’s massive data‑center operations detect hardware failures early, automatically isolate faulty servers, and execute self‑healing workflows through a centralized, cloud‑native platform, detailing detection methods, convergence rules, architecture evolution, and the benefits of a closed‑loop AIOps system.

AIOpsOperationscloud-native

0 likes · 15 min read

How Alibaba Automates Hardware Fault Detection and Self‑Healing at Scale

Efficient Ops

Jun 13, 2018 · Operations

Designing an Effective CMDB: Boost Ops Efficiency, Alert Convergence & Self‑Healing

This article explains how a well‑designed CMDB abstracts and models operational objects, categorizes business, hardware, application and custom data, and enables alert convergence and automated fault‑healing, dramatically improving DevOps efficiency and reliability.

CMDBInfrastructure Automationalert convergence

0 likes · 7 min read

Designing an Effective CMDB: Boost Ops Efficiency, Alert Convergence & Self‑Healing

MaGe Linux Operations

May 16, 2018 · Operations

How to Build an Automated Fault‑Healing System for Enterprise Ops

This article explores the end‑to‑end design of an enterprise‑grade fault‑self‑healing solution, covering the basic workflow, abstraction of alert handling, CMDB‑based resource mapping, internal gateway integration, monitoring platform adapters like Zabbix and Open‑Falcon, convergence logic, complex alarm orchestration, and the overall technical architecture.

AIOpsCMDBMonitoring

0 likes · 9 min read

How to Build an Automated Fault‑Healing System for Enterprise Ops

Suning Technology

Nov 20, 2017 · Big Data

How ZEUS Turns Monitoring Data into Automated Decisions for Enterprise Systems

ZEUS, Suning’s decision analysis platform, integrates monitoring data from tools like Baymax and HIRO, applies CEP aggregation and Drools rule evaluation, and leverages big‑data storage and machine‑learning models to automatically identify root causes, provide real‑time alerts, and enable self‑healing in large‑scale distributed systems.

Big DataRule Enginedecision analysis

0 likes · 14 min read

How ZEUS Turns Monitoring Data into Automated Decisions for Enterprise Systems

MaGe Linux Operations

Nov 18, 2017 · Operations

Automate Incident Response with BlueKing Fault Self‑Healing and Zabbix

This article shares a hands‑on guide to using BlueKing's Fault Self‑Healing (FTA) platform with Zabbix, detailing benefits, integration steps, package creation, convergence rules, job‑script linking, and real‑world case studies that dramatically reduce manual alert handling time.

BlueKingOperationsZabbix

0 likes · 8 min read

Automate Incident Response with BlueKing Fault Self‑Healing and Zabbix

Qunar Tech Salon

Jun 16, 2017 · Operations

OpsRobot: Chatbot‑Based Operations Automation Platform Overview

OpsRobot integrates development tools into a chat‑based interface, using custom plugins and APIs to automate low‑efficiency, error‑prone operational tasks, thereby streamlining workflows, improving efficiency, and enabling future capabilities such as self‑healing and automated scaling.

API GatewayChatbotOps Automation

0 likes · 5 min read

OpsRobot: Chatbot‑Based Operations Automation Platform Overview

Efficient Ops

Apr 19, 2016 · Operations

How Tencent’s Blue Whale Powers Unattended Ops, SaaS Automation, and DevOps Value

The talk outlines Tencent’s Blue Whale platform, describing how automated publishing tools, unattended change processes, fault‑handling strategies, alert‑driven self‑healing, low‑cost tool culture, and a thriving DevOps ecosystem together transform operations from routine maintenance to high‑value, scalable services.

SaaSTool Culturedevops

0 likes · 12 min read

How Tencent’s Blue Whale Powers Unattended Ops, SaaS Automation, and DevOps Value