Tagged articles
13 articles
Page 1 of 1
Architecture and Beyond
Architecture and Beyond
Jul 21, 2024 · Operations

Mastering Backend Stability: 7 Essential Practices for High Availability

This comprehensive guide outlines the seven key pillars—operations, high‑availability architecture, capacity governance, change management, risk governance, fault management, and chaos engineering—that together form a systematic approach to building and maintaining a reliable, 24‑hour backend system.

Operationsbackend stabilitycapacity planning
0 likes · 40 min read
Mastering Backend Stability: 7 Essential Practices for High Availability
Efficient Ops
Efficient Ops
Jul 7, 2024 · Operations

Boost Business Continuity and IT System Stability: Practical Strategies

This article explains business continuity concepts, outlines the risks to IT system stability, and provides actionable steps—such as expanding monitoring coverage, improving fault detection, enhancing architecture resilience, and strengthening emergency coordination—to ensure continuous operation despite inevitable failures.

business continuitydisaster recoveryfault management
0 likes · 7 min read
Boost Business Continuity and IT System Stability: Practical Strategies
dbaplus Community
dbaplus Community
Jan 14, 2024 · Operations

How AI-Driven Event Intelligence Transforms Data Center Fault Management

The article explains the design and functionality of an AI‑enhanced event intelligent analysis system that automates fault identification, analysis, and remediation in data‑center operations, detailing its architecture, integration with monitoring, CMDB, ITSM, big‑data platforms, and the AI techniques that enable automatic modeling, clustering, and knowledge‑base retrieval.

AIAutomationBig Data
0 likes · 18 min read
How AI-Driven Event Intelligence Transforms Data Center Fault Management
Code Ape Tech Column
Code Ape Tech Column
Jul 26, 2023 · Operations

Service Governance: Monitoring, Fault Management, Release and Capacity Planning

This article explains how to achieve 24/7 service availability through comprehensive monitoring, fault handling, release management, and capacity planning, covering alarm types, batch processing, traffic and resource metrics, fault causes and mitigation, deployment strategies, scaling commands, and service degradation techniques.

capacity planningfault managementrelease-management
0 likes · 20 min read
Service Governance: Monitoring, Fault Management, Release and Capacity Planning
Alibaba Cloud Developer
Alibaba Cloud Developer
Aug 30, 2022 · Operations

Mastering Cloud Business Stability: Proven Methods & Real-World Cases

This whitepaper presents a comprehensive methodology for ensuring cloud‑based business stability, covering conceptual frameworks, fault‑management processes, change‑control standards, and detailed industry case studies such as new‑game launches, container deployments, live‑event streaming, and high‑availability architecture design.

Case Studieschange controlcloud stability
0 likes · 2 min read
Mastering Cloud Business Stability: Proven Methods & Real-World Cases
NetEase Game Operations Platform
NetEase Game Operations Platform
Apr 23, 2022 · Artificial Intelligence

Design and Implementation of an AI‑Driven Intelligent Operations Platform for Game Services

The article presents a comprehensive overview of an AI‑ops platform for game operations, covering its background, roadmap, team structure, business scenarios, anomaly‑detection techniques, platform architecture, detection workflow, model deployment, and intelligent fault‑management strategies.

Intelligent Operationsfault managementplatform architecture
0 likes · 20 min read
Design and Implementation of an AI‑Driven Intelligent Operations Platform for Game Services
Efficient Ops
Efficient Ops
Dec 25, 2021 · Artificial Intelligence

How Zhejiang Mobile’s AIOps Achieved National‑Level Excellence in Fault Management

The article explains AIOps fundamentals, details Zhejiang Mobile’s successful assessment in the national AIOps capability maturity model, shares insights from an interview with the company’s network‑management deputy director, and outlines future plans and industry recommendations for AI‑driven IT operations.

Capability Maturity ModelIT OperationsZhejiang Mobile
0 likes · 9 min read
How Zhejiang Mobile’s AIOps Achieved National‑Level Excellence in Fault Management
dbaplus Community
dbaplus Community
Nov 23, 2020 · Operations

Mastering Fault Management: Building a Robust SRE Stability Framework

This article outlines a comprehensive SRE fault‑management framework, covering core responsibilities, stability metrics such as MTBF and MTTR, detailed pre‑, during‑, and post‑incident processes, monitoring, capacity planning, disaster‑recovery, error budgeting, organizational support, and future trends like AIOps and chaos engineering.

Error BudgetMTBFMTTR
0 likes · 30 min read
Mastering Fault Management: Building a Robust SRE Stability Framework
dbaplus Community
dbaplus Community
Jul 13, 2020 · Operations

14 Expert Q&A on Building an Effective SRE System for Fault Management

In this detailed Q&A, a Meitu SRE leader explains the relationship between DevOps and SRE, shares practical advice on team composition, monitoring, alerting, fault‑prevention design, and provides step‑by‑step guidance using Grafana, draw.io, and other tools to help organizations build reliable services.

DevOpsGrafanaSRE
0 likes · 10 min read
14 Expert Q&A on Building an Effective SRE System for Fault Management
Meituan Technology Team
Meituan Technology Team
Oct 26, 2017 · Operations

Evolution of Payment Channel Automation Management at Meituan-Dianping

Meituan‑Dianping’s payment team progressed from manual fault alerts to a fully automated channel management system that detects failures, disables affected banks, conducts controlled ramp‑up tests, and restores service, dramatically cutting response times, manpower costs, and secondary‑failure risks while boosting overall availability.

OperationsSystem Designfault management
0 likes · 14 min read
Evolution of Payment Channel Automation Management at Meituan-Dianping
Architecture Digest
Architecture Digest
Sep 17, 2017 · R&D Management

Comprehensive R&D Management Practices: Task Management, Documentation, Code Collaboration, QA, Deployment, and Fault Handling

This article presents a detailed, experience‑driven guide to building an efficient R&D management system covering the product lifecycle, task management, documentation, code collaboration, quality assurance, automated deployment, fault management, instant communication, and techniques for continuous technical improvement.

Code CollaborationDeploymentDocumentation
0 likes · 23 min read
Comprehensive R&D Management Practices: Task Management, Documentation, Code Collaboration, QA, Deployment, and Fault Handling
Efficient Ops
Efficient Ops
Aug 16, 2017 · Operations

How Qunar Built an Automated Hardware Operations Platform to Boost Efficiency

This article details Qunar's end‑to‑end hardware automation system, covering background challenges, lifecycle management, automated testing, data collection, fault detection, and visualized monitoring, and explains how the integrated platform reduces manual effort, improves reliability, and cuts operational costs.

CMDBOperationsfault management
0 likes · 22 min read
How Qunar Built an Automated Hardware Operations Platform to Boost Efficiency
MaGe Linux Operations
MaGe Linux Operations
Apr 24, 2015 · Operations

10 Proven Fault Management Practices Every Ops Team Should Master

This guide shares ten practical fault‑management techniques—ranging from proactive attitude and prioritizing incidents to continuous follow‑up and team collaboration—to help operations teams reduce damage, maintain service reliability, and keep users engaged during outages.

Operationsbest practicesfault management
0 likes · 8 min read
10 Proven Fault Management Practices Every Ops Team Should Master