Tagged articles

fault management

13 articles · Page 1 of 1

Jul 21, 2024 · Operations

Mastering Backend Stability: 7 Essential Practices for High Availability

This comprehensive guide outlines the seven key pillars—operations, high‑availability architecture, capacity governance, change management, risk governance, fault management, and chaos engineering—that together form a systematic approach to building and maintaining a reliable, 24‑hour backend system.

Change ManagementHigh AvailabilityOperations

0 likes · 40 min read

Mastering Backend Stability: 7 Essential Practices for High Availability

Efficient Ops

Jul 7, 2024 · Operations

Boost Business Continuity and IT System Stability: Practical Strategies

This article explains business continuity concepts, outlines the risks to IT system stability, and provides actionable steps—such as expanding monitoring coverage, improving fault detection, enhancing architecture resilience, and strengthening emergency coordination—to ensure continuous operation despite inevitable failures.

Disaster RecoveryMonitoringbusiness continuity

0 likes · 7 min read

Boost Business Continuity and IT System Stability: Practical Strategies

dbaplus Community

Jan 14, 2024 · Operations

How AI-Driven Event Intelligence Transforms Data Center Fault Management

The article explains the design and functionality of an AI‑enhanced event intelligent analysis system that automates fault identification, analysis, and remediation in data‑center operations, detailing its architecture, integration with monitoring, CMDB, ITSM, big‑data platforms, and the AI techniques that enable automatic modeling, clustering, and knowledge‑base retrieval.

AIAutomationBig Data

0 likes · 18 min read

How AI-Driven Event Intelligence Transforms Data Center Fault Management

Code Ape Tech Column

Jul 26, 2023 · Operations

Service Governance: Monitoring, Fault Management, Release and Capacity Planning

This article explains how to achieve 24/7 service availability through comprehensive monitoring, fault handling, release management, and capacity planning, covering alarm types, batch processing, traffic and resource metrics, fault causes and mitigation, deployment strategies, scaling commands, and service degradation techniques.

Service Governancecapacity planningfault management

0 likes · 20 min read

Service Governance: Monitoring, Fault Management, Release and Capacity Planning

Alibaba Cloud Developer

Aug 30, 2022 · Operations

Mastering Cloud Business Stability: Proven Methods & Real-World Cases

This whitepaper presents a comprehensive methodology for ensuring cloud‑based business stability, covering conceptual frameworks, fault‑management processes, change‑control standards, and detailed industry case studies such as new‑game launches, container deployments, live‑event streaming, and high‑availability architecture design.

Case StudiesHigh Availabilitychange control

0 likes · 2 min read

Mastering Cloud Business Stability: Proven Methods & Real-World Cases

NetEase Game Operations Platform

Apr 23, 2022 · Artificial Intelligence

Design and Implementation of an AI‑Driven Intelligent Operations Platform for Game Services

The article presents a comprehensive overview of an AI‑ops platform for game operations, covering its background, roadmap, team structure, business scenarios, anomaly‑detection techniques, platform architecture, detection workflow, model deployment, and intelligent fault‑management strategies.

Intelligent Operationsfault managementplatform architecture

0 likes · 20 min read

Design and Implementation of an AI‑Driven Intelligent Operations Platform for Game Services

Efficient Ops

Dec 25, 2021 · Artificial Intelligence

How Zhejiang Mobile’s AIOps Achieved National‑Level Excellence in Fault Management

The article explains AIOps fundamentals, details Zhejiang Mobile’s successful assessment in the national AIOps capability maturity model, shares insights from an interview with the company’s network‑management deputy director, and outlines future plans and industry recommendations for AI‑driven IT operations.

AIOpsArtificial IntelligenceCapability Maturity Model

0 likes · 9 min read

How Zhejiang Mobile’s AIOps Achieved National‑Level Excellence in Fault Management

dbaplus Community

Nov 23, 2020 · Operations

Mastering Fault Management: Building a Robust SRE Stability Framework

This article outlines a comprehensive SRE fault‑management framework, covering core responsibilities, stability metrics such as MTBF and MTTR, detailed pre‑, during‑, and post‑incident processes, monitoring, capacity planning, disaster‑recovery, error budgeting, organizational support, and future trends like AIOps and chaos engineering.

Error BudgetMTBFMTTR

0 likes · 30 min read

Mastering Fault Management: Building a Robust SRE Stability Framework

dbaplus Community

Jul 13, 2020 · Operations

14 Expert Q&A on Building an Effective SRE System for Fault Management

In this detailed Q&A, a Meitu SRE leader explains the relationship between DevOps and SRE, shares practical advice on team composition, monitoring, alerting, fault‑prevention design, and provides step‑by‑step guidance using Grafana, draw.io, and other tools to help organizations build reliable services.

DevOpsSRESite Reliability Engineering

0 likes · 10 min read

14 Expert Q&A on Building an Effective SRE System for Fault Management

Meituan Technology Team

Oct 26, 2017 · Operations

Evolution of Payment Channel Automation Management at Meituan-Dianping

Meituan‑Dianping’s payment team progressed from manual fault alerts to a fully automated channel management system that detects failures, disables affected banks, conducts controlled ramp‑up tests, and restores service, dramatically cutting response times, manpower costs, and secondary‑failure risks while boosting overall availability.

MonitoringOperationsRouting

0 likes · 14 min read

Evolution of Payment Channel Automation Management at Meituan-Dianping

Architecture Digest

Sep 17, 2017 · R&D Management

Comprehensive R&D Management Practices: Task Management, Documentation, Code Collaboration, QA, Deployment, and Fault Handling

This article presents a detailed, experience‑driven guide to building an efficient R&D management system covering the product lifecycle, task management, documentation, code collaboration, quality assurance, automated deployment, fault management, instant communication, and techniques for continuous technical improvement.

Code CollaborationTask Managementdeployment

0 likes · 23 min read

Comprehensive R&D Management Practices: Task Management, Documentation, Code Collaboration, QA, Deployment, and Fault Handling

Efficient Ops

Aug 16, 2017 · Operations

How Qunar Built an Automated Hardware Operations Platform to Boost Efficiency

This article details Qunar's end‑to‑end hardware automation system, covering background challenges, lifecycle management, automated testing, data collection, fault detection, and visualized monitoring, and explains how the integrated platform reduces manual effort, improves reliability, and cuts operational costs.

CMDBMonitoringOperations

0 likes · 22 min read

How Qunar Built an Automated Hardware Operations Platform to Boost Efficiency

MaGe Linux Operations

Apr 24, 2015 · Operations

10 Proven Fault Management Practices Every Ops Team Should Master

This guide shares ten practical fault‑management techniques—ranging from proactive attitude and prioritizing incidents to continuous follow‑up and team collaboration—to help operations teams reduce damage, maintain service reliability, and keep users engaged during outages.

Operationsbest practicesfault management

0 likes · 8 min read

10 Proven Fault Management Practices Every Ops Team Should Master