Tagged articles
131 articles
Page 1 of 2
DevOps Coach
DevOps Coach
Mar 31, 2026 · Operations

How AI‑Driven Observability Can Cut MTTR: A 12‑Step Investigation Framework

This article explains how modern SRE teams can combine AI‑assisted observability with structured critical thinking to build a 12‑step investigation model that accelerates fault detection, hypothesis generation, telemetry validation, root‑cause analysis, and automated remediation, ultimately reducing MTTR and improving reliability.

AIObservabilityOperations
0 likes · 9 min read
How AI‑Driven Observability Can Cut MTTR: A 12‑Step Investigation Framework
Code Wrench
Code Wrench
Dec 16, 2025 · Operations

Rescuing a Failing System: 3‑Step Triage Playbook Every Go Engineer Needs

This article explains how to demonstrate real‑world system‑engineering expertise in Go interviews by mastering incident triage, diagnosing CPU, memory, GC, and goroutine problems, and applying a three‑step "stop‑bleed, diagnose, cure" strategy to keep services alive.

GoOperationsincident management
0 likes · 11 min read
Rescuing a Failing System: 3‑Step Triage Playbook Every Go Engineer Needs
Liangxu Linux
Liangxu Linux
Nov 20, 2025 · Operations

Avoid the ‘Delete‑Database‑and‑Run’ Nightmare: 10 Fatal Ops Pitfalls Revealed

A real 2018 incident where an ops engineer used rm ‑rf to wipe a production database sparked a deep dive into the high‑risk nature of operations, presenting Gartner statistics, psychological error factors, ten deadly pitfalls with concrete examples, and a comprehensive fault‑tolerance framework to prevent future catastrophes.

BackupDevOpsSecurity
0 likes · 23 min read
Avoid the ‘Delete‑Database‑and‑Run’ Nightmare: 10 Fatal Ops Pitfalls Revealed
DevOps Coach
DevOps Coach
Nov 11, 2025 · Cloud Computing

Why the US‑East‑1 AWS Outage Happened and How to Guard Against It

On October 19‑20 a massive AWS failure in the US‑East‑1 region crippled a large portion of the internet, exposing how a faulty internal monitoring tool, DynamoDB’s lack of cross‑region replication, and unchecked retry storms can cascade into a widespread outage, and offering concrete operational lessons for cloud teams.

AWSDynamoDBOutage
0 likes · 7 min read
Why the US‑East‑1 AWS Outage Happened and How to Guard Against It
Alibaba Cloud Developer
Alibaba Cloud Developer
Oct 9, 2025 · Operations

How AI‑Powered Multi‑Agent Systems Turn Fault Postmortems into Proactive Risk Prevention

This article explains how an AI‑driven multi‑agent platform automates fault postmortem generation, enriches analysis with memory management, prompt engineering, and RAG techniques, and delivers actionable insights for SREs, developers, and non‑technical stakeholders, ultimately shifting incident handling from reactive to proactive.

AILLMSRE
0 likes · 44 min read
How AI‑Powered Multi‑Agent Systems Turn Fault Postmortems into Proactive Risk Prevention
Programmer DD
Programmer DD
Oct 3, 2025 · Operations

How Netflix Turned Incident Management into a Scalable Engineer‑Owned Process

This article explains how Netflix’s engineering teams shifted incident handling from a centralized SRE function to a company‑wide, engineer‑driven practice by selecting the right tooling, standardizing processes, and reshaping culture, enabling rapid, reliable responses for hundreds of millions of viewers.

NetflixSRETool Selection
0 likes · 10 min read
How Netflix Turned Incident Management into a Scalable Engineer‑Owned Process
MaGe Linux Operations
MaGe Linux Operations
Sep 28, 2025 · Operations

What Core Skills Do SRE Engineers Need to Master?

This article outlines the essential technical, incident‑response, reliability‑management, collaboration, and systemic‑thinking abilities that Site Reliability Engineering (SRE) professionals must develop to ensure high‑availability, stable services in modern internet environments.

CollaborationSRESite Reliability Engineering
0 likes · 5 min read
What Core Skills Do SRE Engineers Need to Master?
Ops Community
Ops Community
Sep 16, 2025 · Operations

Mastering SRE: Fast Incident Response and Prevention Strategies

This guide walks SRE engineers through a complete incident lifecycle—preventive multi‑layer monitoring, chaos‑testing drills, rapid 10‑minute response tactics, systematic root‑cause analysis, effective communication roles, post‑mortem reviews, and practical case studies—helping teams minimize downtime and business loss.

Root Cause AnalysisSREincident management
0 likes · 11 min read
Mastering SRE: Fast Incident Response and Prevention Strategies
Java Web Project
Java Web Project
Jul 26, 2025 · Backend Development

How a Simple Pagination Change Triggered a P0 Outage and What We Learned

A seemingly trivial pagination update in a Java order service caused a P0 outage, leading to a 73‑minute disruption, 156 user complaints, and an estimated 650,000 CNY GMV loss; the post details the root cause, impact analysis, emergency response, and concrete process improvements to prevent recurrence.

BackendJavaMicroservices
0 likes · 14 min read
How a Simple Pagination Change Triggered a P0 Outage and What We Learned
Ops Development & AI Practice
Ops Development & AI Practice
Jul 18, 2025 · Operations

Mastering Modern Software Operations: The Six Essential Steps for Success

Modern software operations have shifted from a post‑launch checklist to an ongoing, automated discipline, and this article outlines the six core phases—requirement planning, CI/CD automation, comprehensive monitoring, incident response, performance tuning, and security compliance—providing concrete examples and practical advice for building a resilient DevOps culture.

DevOpsOperationsPerformance Optimization
0 likes · 9 min read
Mastering Modern Software Operations: The Six Essential Steps for Success
Volcano Engine Developer Services
Volcano Engine Developer Services
May 22, 2025 · Artificial Intelligence

How LLMs Can Automate Ticket Escalation: Inside ByteBrain’s TickIt System

This article introduces TickIt, a ByteBrain system that leverages large language models to automatically identify and escalate critical Oncall tickets, detailing its multi‑class escalation, deduplication, and category‑guided fine‑tuning modules, experimental results, and the operational impact on cloud services.

LLMOncall analysisSupervised Fine‑Tuning
0 likes · 13 min read
How LLMs Can Automate Ticket Escalation: Inside ByteBrain’s TickIt System
Architecture and Beyond
Architecture and Beyond
May 10, 2025 · Operations

What Heinrich’s 1:29:300 Rule Reveals About Preventing Online Outages

The article explains Heinrich's Law, its 1:29:300 accident pyramid, and how applying its principles—tracking minor incidents, hidden hazards, and systemic risks—can help software teams anticipate, diagnose, and prevent major online failures through systematic safety management and data‑driven practices.

Heinrich's LawOperationsincident management
0 likes · 15 min read
What Heinrich’s 1:29:300 Rule Reveals About Preventing Online Outages
Liangxu Linux
Liangxu Linux
Apr 6, 2025 · Operations

How to Define SLIs, SLOs, and SLAs for Effective SRE Practices

This guide explains how SRE teams should collaborate early in the software development lifecycle to define Service Level Indicators (SLIs), set realistic Service Level Objectives (SLOs) and Service Level Agreements (SLAs), and integrate observability signals, error budgeting, risk management, and incident handling into reliable operations.

Error BudgetObservabilityOperations
0 likes · 13 min read
How to Define SLIs, SLOs, and SLAs for Effective SRE Practices
Open Source Linux
Open Source Linux
Mar 27, 2025 · Operations

10 Critical Server Ops Mistakes to Avoid: Real-World Lessons

This article outlines ten critical server operation mistakes—ranging from forced power cuts to neglecting updates—illustrated with real-world incidents and practical advice, helping engineers adopt safer practices, proper backups, secure configurations, and effective monitoring to prevent costly outages.

best practicesincident managementserver operations
0 likes · 6 min read
10 Critical Server Ops Mistakes to Avoid: Real-World Lessons
Efficient Ops
Efficient Ops
Mar 24, 2025 · Operations

15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them

This article outlines fifteen common operational and management mistakes—such as frequent incidents, excessive new hires, lack of automation, and missing rollback plans—that can trigger system outages, and offers guidance on how teams can strengthen testing, processes, and team capabilities to prevent downtime.

DevOpsOperationsSRE
0 likes · 6 min read
15 Common Ops Mistakes That Can Crash Your System – How to Avoid Them
JD Tech Talk
JD Tech Talk
Feb 26, 2025 · Operations

Business Monitoring: Importance, Metric System Design, and Practical Implementation

This article explains the significance of business monitoring, distinguishes technical and business metrics, outlines a step‑by‑step process for building a business metric system, and shares practical experiences, tools, and common pitfalls to help teams improve operational reliability and decision‑making.

Operationsbusiness monitoringincident management
0 likes · 13 min read
Business Monitoring: Importance, Metric System Design, and Practical Implementation
Alibaba Cloud Observability
Alibaba Cloud Observability
Feb 17, 2025 · Cloud Native

Mastering Enterprise Alerting: Build a Robust Cloud‑Native Monitoring System

This article explores how to design and implement a comprehensive, enterprise‑grade alerting system—covering monitoring fundamentals, MTTF/MTTR concepts, multi‑layer metric collection, alert rule best practices, severity levels, notification channels, false‑positive reduction, and real‑world case studies—to ensure reliable cloud‑native operations.

AlertingMTTROperations
0 likes · 35 min read
Mastering Enterprise Alerting: Build a Robust Cloud‑Native Monitoring System
JD Cloud Developers
JD Cloud Developers
Feb 6, 2025 · Operations

How to Build a Robust Stability Framework: Key Mechanisms for SRE Success

This article outlines a comprehensive stability framework for SRE teams, detailing essential mechanisms such as review processes, coding standards, incident management, on‑call responsibilities, and daily operational practices, while also highlighting the cultural shift needed to achieve reliable, high‑availability systems.

OperationsSREincident management
0 likes · 11 min read
How to Build a Robust Stability Framework: Key Mechanisms for SRE Success
dbaplus Community
dbaplus Community
Sep 13, 2024 · Operations

How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability

This article explains how Bilibili designed and implemented an Emergency Response Center (ERC) to manage the full fault lifecycle—detection, response, delimitation, localization, mitigation and recovery—using alert rules, automated recall, integrated customer feedback, clear role assignments, mobile support, and data‑driven post‑mortems, ultimately reducing MTTR and improving service reliability.

MTTRSREautomation
0 likes · 23 min read
How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability
Cognitive Technology Team
Cognitive Technology Team
Aug 17, 2024 · Operations

GitHub Outage on August 14, 2024: Causes, Impact, and Recovery

On August 14, 2024, GitHub experienced a massive site-wide outage caused by a database infrastructure configuration change that disrupted traffic routing, leading to loss of database connections and affecting core services such as Pull Requests, Pages, Copilot, and the API, with full restoration confirmed later that evening.

GitHubOutagedatabase
0 likes · 2 min read
GitHub Outage on August 14, 2024: Causes, Impact, and Recovery
Bilibili Tech
Bilibili Tech
Aug 16, 2024 · Operations

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

Bilibili’s Emergency Response Center (ERC) unifies incident detection, rapid response, precise scoping, and coordinated recovery through multi‑dimensional alerts, automated collaboration, standardized updates, and post‑mortem analysis, targeting 1‑minute detection, 3‑minute response, 5‑minute scoping, and 10‑minute recovery, which has cut MTTR, achieved over 80% automatic recall accuracy, and met more than 60% of its 1‑3‑5‑10 performance goals.

MTTRSREautomation
0 likes · 22 min read
Design and Implementation of Bilibili's Emergency Response Center for Incident Management
21CTO
21CTO
Aug 15, 2024 · Operations

Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained

A detailed account of GitHub’s recent worldwide outage reveals that a rollback of database infrastructure changes caused widespread service failures across GitHub.com, Pages, Copilot, and the API, highlighting the challenges of stateful database reliability in large platforms.

GitHubOperationsOutage
0 likes · 4 min read
Why GitHub’s Massive Outage Happened: Database Infrastructure Rollback Explained
Continuous Delivery 2.0
Continuous Delivery 2.0
Aug 13, 2024 · R&D Management

Is Amazon's COE Process Really Effective? Insights from SDEs

The article examines Amazon's Correction of Errors (COE) process, presenting both supportive and critical SDE perspectives, and discusses whether detailed post‑incident documentation truly improves engineering practices or merely adds bureaucratic overhead.

COEEngineering CultureSDE
0 likes · 10 min read
Is Amazon's COE Process Really Effective? Insights from SDEs
Open Source Linux
Open Source Linux
Jul 24, 2024 · Operations

Linux Emergency Handbook v1.2: Key Updates & New Incident Response Practices

Version 1.2 of the Linux Emergency Handbook introduces critical updates such as SSH key backdoor checks, detailed command timestamp logs, new journalctl log viewing techniques, enhanced password checks, added data USB guidance, and revamped post‑incident stages including routine security checks, loss assessment, and targeted investigations.

LinuxSecurityemergency response
0 likes · 3 min read
Linux Emergency Handbook v1.2: Key Updates & New Incident Response Practices
DevOps Coach
DevOps Coach
Jun 30, 2024 · Operations

Effective Incident Mitigation and Recovery: Practical SRE Strategies

The article outlines SRE‑based incident mitigation and recovery practices, covering urgent mitigations, impact reduction, key metrics such as TTD, TTR, TBF, and detailed strategies for shortening detection and repair times, preventing fatigue, improving observability, and designing resilient systems.

MitigationOperationsReliability
0 likes · 23 min read
Effective Incident Mitigation and Recovery: Practical SRE Strategies
Efficient Ops
Efficient Ops
May 20, 2024 · Operations

Mastering Incident Blame: Proven Tactics to Navigate Fault Responsibility

This guide outlines practical principles and communication techniques for assigning responsibility during system incidents, helping operations teams stay calm, choose allies wisely, and protect themselves while ensuring effective fault resolution and continuous improvement.

blame assignmentcommunication tacticsincident management
0 likes · 9 min read
Mastering Incident Blame: Proven Tactics to Navigate Fault Responsibility
Efficient Ops
Efficient Ops
May 12, 2024 · Operations

From Firefighting to Fire‑Starting: Mastering Operations for System Reliability

The article outlines a three‑stage evolution of operations—from rapid incident response to proactive fault‑injection—while offering practical guidance on improving availability, visualizing changes, and aligning technical metrics with business value to elevate the role of operations engineers.

AvailabilityFault InjectionSRE
0 likes · 7 min read
From Firefighting to Fire‑Starting: Mastering Operations for System Reliability
ITPUB
ITPUB
May 10, 2024 · Databases

Choosing Low‑Risk Strategies for Critical DBA Outages

When a major operations incident strikes, the safest approach is to prioritize simple, low‑risk actions and accept limited responsibility, as illustrated by real DBA lessons from Oracle RAC failures and a data‑center power‑loss disaster.

DBAOperationsOracle RAC
0 likes · 7 min read
Choosing Low‑Risk Strategies for Critical DBA Outages
Efficient Ops
Efficient Ops
May 7, 2024 · Operations

11 Hard‑Earned Lessons from Two Decades of Google Site Reliability

Drawing on twenty years of Google’s SRE experience, this article shares eleven practical lessons—from proportional incident mitigation and pre‑tested recovery mechanisms to canary releases, disaster‑resilience testing, and frequent deployments—aimed at improving reliability and operational efficiency.

GoogleSREincident management
0 likes · 12 min read
11 Hard‑Earned Lessons from Two Decades of Google Site Reliability
DevOps Cloud Academy
DevOps Cloud Academy
Apr 22, 2024 · Cloud Native

Understanding Platform Engineering: Principles, Tools, and Emerging Trends

This article explains how platform engineering formalizes internal processes and tools to give developers a self‑service, automated "golden path," outlines its six core categories—including internal developer portals, infrastructure as code, and incident management—and discusses its growing impact on modern cloud‑native development.

Internal Developer Portalincident managementplatform engineering
0 likes · 9 min read
Understanding Platform Engineering: Principles, Tools, and Emerging Trends
Efficient Ops
Efficient Ops
Jan 14, 2024 · Operations

Mastering Incident Command: A Practical Guide for SRE Fault Handling

This article outlines a comprehensive, step‑by‑step approach for SRE incident commanders, covering fault perception, grading, team organization, remediation tactics, transparent communication, and post‑mortem practices to efficiently resolve service disruptions.

SREfault handlingincident management
0 likes · 14 min read
Mastering Incident Command: A Practical Guide for SRE Fault Handling
High Availability Architecture
High Availability Architecture
Jan 9, 2024 · Operations

AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation

This article presents Meituan's two‑year exploration of AIOps in incident management, detailing risk‑prevention change detection, real‑time anomaly discovery, automated root‑cause diagnosis, multi‑dimensional KPI analysis, and similar‑event recommendation, while sharing architectural designs, algorithmic techniques, performance results, and future directions.

NLPOperationsRoot Cause Analysis
0 likes · 24 min read
AIOps Practices for Incident Management at Meituan: From Risk Prevention to Post‑Operation
dbaplus Community
dbaplus Community
Jan 8, 2024 · Operations

How a Simple Time Adjustment Sparked a Massive Outage: Real Ops Incident Stories

Three real-world operations mishaps are recounted—a mistaken system‑time change that logged out thousands of users, an accidental bulk delete of database accounts, and a failed glibc downgrade that stalled a software release—illustrating the cascading impact of small errors and the urgent remediation steps taken.

LinuxOperationsSysadmin
0 likes · 8 min read
How a Simple Time Adjustment Sparked a Massive Outage: Real Ops Incident Stories
Architect
Architect
Dec 22, 2023 · Operations

How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR

The article details Tencent Search’s end‑to‑end stability engineering practice, covering a ten‑step architecture that combines redundancy, proactive detection, rapid emergency response, automated cut‑over, defensive caching, and continuous drills, and shows how these measures collectively reduced mean‑time‑to‑detect and mean‑time‑to‑recover by an order of magnitude while keeping service availability high.

ObservabilityResiliencearchitecture
0 likes · 32 min read
How Tencent Search Built a Multi‑Layered Stability Architecture to Slash MTTD and MTTR
Meituan Technology Team
Meituan Technology Team
Dec 21, 2023 · Operations

AIOps for Incident Management: Practices and Insights from Meituan

Meituan’s service‑operations team applies AIOps across prevention, detection, and post‑incident stages—using change‑risk analysis, real‑time graph‑based anomaly detection, similarity‑driven root‑cause diagnosis, and NLP‑powered incident recommendation—to achieve sub‑second detection, high precision, 28% faster fault handling, and plans for intelligent log and change recognition.

OperationsRoot Cause Analysisaiops
0 likes · 24 min read
AIOps for Incident Management: Practices and Insights from Meituan
Bilibili Tech
Bilibili Tech
Dec 15, 2023 · Operations

Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis

Bilibili revamped its alert monitoring platform to meet rapid growth, focusing on effectiveness, timeliness, and coverage; it introduced a closed‑loop design and governance that cut weekly alerts by 90%, built a knowledge‑graph root‑cause system achieving 87.9% accuracy with sub‑minute latency, and integrated AIOps for ongoing refinement.

Alert MonitoringBilibiliRoot Cause Analysis
0 likes · 21 min read
Bilibili Alert Monitoring System: Design, Optimization, and Root‑Cause Analysis
Architect
Architect
Dec 13, 2023 · Industry Insights

How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship

This article details Bilibili's end‑to‑end technical planning, traffic‑estimation models, and concrete optimizations—including hotspot caching, traffic dispersion, long‑connection isolation, and automated fault‑injection—that enabled the S13 League of Legends finals to serve over 1.2 billion viewers with stable, low‑latency streaming.

ObservabilityPerformance OptimizationTraffic Engineering
0 likes · 22 min read
How Bilibili Engineered a 1.2 B‑Viewer Live Stream for the LoL World Championship
Efficient Ops
Efficient Ops
Dec 11, 2023 · Operations

How a Simple System‑Time Change Sparked a Massive Outage

A junior ops engineer mistakenly set the production server clock ahead by a year, causing thousands of user accounts to expire, triggering a large‑scale outage, emergency fixes, financial loss, and harsh career consequences, while highlighting the need for proper permission and change management.

Permissionsdatabaseincident management
0 likes · 7 min read
How a Simple System‑Time Change Sparked a Massive Outage
dbaplus Community
dbaplus Community
Nov 23, 2023 · Operations

How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples

This article explains why monitoring alert noise harms efficiency, presents metrics such as recall and accuracy, details rule‑based, blacklist/whitelist, ratio‑based, and intelligent noise‑reduction techniques, shares Java code examples, and shows measurable results after applying the governance process.

Alert Noise ReductionOperationsincident management
0 likes · 13 min read
How to Cut Alert Noise in Monitoring: Proven Strategies and Code Samples
FunTester
FunTester
Nov 21, 2023 · Industry Insights

What Alibaba’s Recent Outages Reveal About Testing and Team Safety

The article examines three major Alibaba service disruptions, analyzes how insufficient testing and a lack of psychological safety among engineers may have contributed to the failures, and suggests ways to improve testing practices and workplace transparency.

AlibabaCloud ServicesPsychological Safety
0 likes · 7 min read
What Alibaba’s Recent Outages Reveal About Testing and Team Safety
Architecture and Beyond
Architecture and Beyond
Oct 29, 2023 · Operations

Postmortem of the October 23 Yuque Service Outage: Lessons on Complex Systems and the KISS Principle

The October 23 Yuque outage, caused by a buggy upgrade tool and outdated storage hardware, highlighted the importance of thorough testing, robust disaster‑recovery, high‑availability architecture, clear communication, continuous learning, and applying the KISS principle to simplify complex systems and improve operational stability.

Complex SystemsKISS principleOperations
0 likes · 10 min read
Postmortem of the October 23 Yuque Service Outage: Lessons on Complex Systems and the KISS Principle
Qunhe Technology Quality Tech
Qunhe Technology Quality Tech
Oct 13, 2023 · Operations

How KuJiaLe Built a Scalable Stability System: Real‑World SRE Lessons

This article shares KuJiaLe's experience tackling stability challenges caused by rapid user growth and system complexity, detailing their organizational, process, cultural, and technical approaches—including goal setting, a stability committee, monitoring, incident response, change control, and regular drills—to achieve measurable improvements in reliability and performance.

DevOpsSREincident management
0 likes · 20 min read
How KuJiaLe Built a Scalable Stability System: Real‑World SRE Lessons
Architect
Architect
Sep 16, 2023 · Operations

Common Production Failures and Their Handling Procedures

This article outlines the most common production failures—including network, server, database, software bugs, security vulnerabilities, storage, configuration errors, and third‑party service issues—and provides detailed steps for detection, investigation, and resolution to ensure system stability and reliability.

Operationsincident managementproduction
0 likes · 28 min read
Common Production Failures and Their Handling Procedures
MaGe Linux Operations
MaGe Linux Operations
Jun 30, 2023 · Operations

What Went Wrong When Vipshop Crashed? Lessons on High‑Concurrency Failures

The article examines the March 29 Vipshop data‑center outage that caused over a billion‑yuan loss, explains the cooling‑system failure that triggered a 12‑hour P0 incident, discusses its impact on Tencent services, and analyzes why high‑concurrency crashes remain common, offering availability tier insights and mitigation strategies.

AvailabilityOperationshigh concurrency
0 likes · 7 min read
What Went Wrong When Vipshop Crashed? Lessons on High‑Concurrency Failures
Test Development Learning Exchange
Test Development Learning Exchange
May 25, 2023 · Operations

Online Incident Severity Level Definition Rules

This document defines the online incident severity grading system, outlining fault categories, influencing factors such as business metrics, capital loss, user impact, and public opinion, and presents detailed P0‑P3 grading rules with tables for capital‑based, C‑end, and B‑end user classifications.

fault classificationincident managementservice reliability
0 likes · 8 min read
Online Incident Severity Level Definition Rules
Efficient Ops
Efficient Ops
May 16, 2023 · Operations

How China Mobile Built a Scalable AIOps Platform to Cut Incident Resolution Time

This article shares China Mobile IT Center's four‑year journey of designing, deploying, and refining a centralized AIOps platform that automates anomaly detection, fault diagnosis, and remediation, dramatically reducing complaint ticket handling from ten to six hours while scaling to billions of AI model calls per month.

AIaiopsincident management
0 likes · 18 min read
How China Mobile Built a Scalable AIOps Platform to Cut Incident Resolution Time
DataFunSummit
DataFunSummit
Apr 15, 2023 · Operations

Observability and Intelligent Alert Management Practices

This presentation outlines the observability ecosystem, the role and value of alerts within it, core functionalities of an intelligent alarm management platform, best‑practice recommendations, and a real‑world case study of deploying a unified observability solution for a large state‑owned investment group.

Alert ManagementIT Operationsaiops
0 likes · 11 min read
Observability and Intelligent Alert Management Practices
MaGe Linux Operations
MaGe Linux Operations
Mar 24, 2023 · Operations

Why Most Monitoring Strategies Fail and How the CAR Framework Fixes Them

This article explains why typical monitoring approaches miss the mark, outlines four root causes of persistent incidents, and introduces the CAR framework—Customer, Application, Resource—to build user‑centric observability that reduces noise, restores trust, and improves reliability.

CAR frameworkOperationsincident management
0 likes · 11 min read
Why Most Monitoring Strategies Fail and How the CAR Framework Fixes Them
Architect's Guide
Architect's Guide
Mar 14, 2023 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

This article outlines a call‑center outage scenario, explains how operators diagnose and resolve the issue, and presents a comprehensive set of fault‑handling methods, monitoring enhancements, and emergency‑plan recommendations aimed at faster recovery and eventual self‑healing of services.

call centerfault-recoveryincident management
0 likes · 12 min read
Incident Handling and Fault Recovery Practices for Call Center Systems
DeWu Technology
DeWu Technology
Feb 8, 2023 · Operations

Container SRE Practices and Incident Management at DeWu

DeWu’s container SRE team combines software‑engineered reliability with routine operations, using defined on‑call roles, SLO/SLA targets, progressive change management, capacity forecasting, four‑metric monitoring, MTTR/MTTF tracking, kernel‑parameter tuning, and namespace‑protected security policies to swiftly resolve incidents such as Redis latency spikes.

ContainerPerformance OptimizationSRE
0 likes · 23 min read
Container SRE Practices and Incident Management at DeWu
HelloTech
HelloTech
Jan 31, 2023 · Operations

Stability Assurance Practices for Large‑Scale Promotional Events

The article outlines a comprehensive stability‑assurance framework for large‑scale promotional events—detailing planning, capacity and pressure‑test rehearsals, strict change‑freeze, internal gray releases, coordinated on‑call response, thorough link and capacity analysis, monitoring, emergency procedures, cross‑team collaboration, external partner coordination, and post‑event review to ensure resilient system performance.

Large-Scale EventsPerformance Testingcapacity planning
0 likes · 17 min read
Stability Assurance Practices for Large‑Scale Promotional Events
Alibaba Cloud Developer
Alibaba Cloud Developer
Sep 14, 2022 · Operations

Mastering System Stability: From Fault Prevention to Emergency Response

This article outlines a comprehensive safety‑production framework that covers pre‑incident fault prevention, incident response, and post‑mortem improvement, detailing design‑for‑failure principles such as redundancy, isolation, idempotence, monitoring, automation, disaster recovery, scaling, rate‑limiting, and continuous testing to ensure reliable, resilient services.

Reliabilitydisaster recoveryincident management
0 likes · 16 min read
Mastering System Stability: From Fault Prevention to Emergency Response
DevOps
DevOps
Aug 15, 2022 · R&D Management

Case Study: Unintended Data Upload Incident and Process Improvement Lessons

This article recounts a real-world incident where a junior engineer mistakenly uploaded production data to a pre‑release environment, analyzes the root causes, outlines concrete process improvements, and highlights broader lessons on risk‑aware development and the importance of holistic business‑logic security.

R&D managementincident managementprocess improvement
0 likes · 8 min read
Case Study: Unintended Data Upload Incident and Process Improvement Lessons
Sanyou's Java Diary
Sanyou's Java Diary
Aug 11, 2022 · Operations

Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns

This article guides developers through classifying system‑level and business‑level bugs, using Linux utilities like perf, ps, and vmstat for quick root‑cause analysis, and outlines effective code‑design patterns and architectural strategies—caching, rate‑limiting, and high‑availability—to prevent and resolve production incidents.

Linux performancebackend operationsbug troubleshooting
0 likes · 13 min read
Rapidly Diagnose Production Bugs with Linux Tools, Performance Tricks & Design Patterns
Top Architect
Top Architect
Aug 2, 2022 · Operations

Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems

This article presents a comprehensive guide on diagnosing, monitoring, and quickly resolving call‑center system failures, covering common troubleshooting steps, monitoring enhancements, emergency‑plan design, and intelligent event‑handling techniques to improve operational reliability and response speed.

Operationsemergency responsefault handling
0 likes · 15 min read
Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems
dbaplus Community
dbaplus Community
Jul 12, 2022 · Operations

How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned

In July 2021 a sudden CPU‑100% spike in Bilibili's OpenResty‑based SLB caused widespread service outages, prompting an emergency response that rebuilt load‑balancer clusters, traced a Lua _gcd function bug triggered by a zero weight string, and led to extensive operational and architectural improvements.

Cloud NativeLuaOpenResty
0 likes · 17 min read
How a Lua Bug Crashed Bilibili’s Load Balancer and What We Learned
Ops Development Stories
Ops Development Stories
Jun 16, 2022 · Operations

How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery

This article outlines a comprehensive approach to handling call‑center incidents, covering fault boundary definition, emergency recovery actions, rapid root‑cause localization, enhanced monitoring strategies, clear alerting, proactive automation, and the creation of concise, regularly exercised emergency response plans.

Operationscall centerfault-recovery
0 likes · 14 min read
How to Streamline Call Center Incident Management: From Rapid Diagnosis to Automated Recovery
Top Architect
Top Architect
Jun 11, 2022 · Operations

Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems

This guide details a call‑center system fault scenario and provides a step‑by‑step approach for operations teams to identify symptoms, assess impact, implement rapid recovery actions, improve monitoring, and maintain an effective emergency response plan, ensuring faster resolution and long‑term fault self‑healing.

Operationscall centeremergency plan
0 likes · 12 min read
Comprehensive Fault Handling and Emergency Response Guide for Call Center Systems
Architecture Digest
Architecture Digest
Jun 2, 2022 · Operations

Incident Handling and Fault Recovery Practices for Call Center Systems

The article outlines a comprehensive approach to diagnosing, responding to, and preventing call‑center system failures by describing typical fault scenarios, step‑by‑step recovery actions, monitoring enhancements, emergency plan components, and continuous improvement strategies for operations teams.

Operationscall centeremergency procedures
0 likes · 13 min read
Incident Handling and Fault Recovery Practices for Call Center Systems
dbaplus Community
dbaplus Community
Apr 10, 2022 · Operations

How to Build a Practical SRE Operations Framework for Large‑Scale Systems

This article presents a hands‑on SRE framework covering the full product lifecycle—code development, resource planning, deployment, operational reliability, and decommissioning—derived from real‑world practices at Xiaomi and Sina to help teams manage massive internet services efficiently and cost‑effectively.

Resource ManagementSRESystem Lifecycle
0 likes · 16 min read
How to Build a Practical SRE Operations Framework for Large‑Scale Systems
Big Data Technology & Architecture
Big Data Technology & Architecture
Apr 6, 2022 · Big Data

Data Quality Issues, Causes, and Practices in Big Data Platforms

This article explains the harms and root causes of data quality problems—such as integrity, latency, accuracy, and consistency issues—then outlines systematic prevention methods, baseline monitoring, and concrete NetEase YouShu platform practices, illustrated with real incidents, code snippets, and tag‑monitoring strategies.

data engineeringincident management
0 likes · 10 min read
Data Quality Issues, Causes, and Practices in Big Data Platforms
Open Source Linux
Open Source Linux
Apr 2, 2022 · Operations

How to Speed Up Call Center Incident Recovery with Proven Ops Strategies

This article walks through a real call‑center outage scenario, outlines systematic fault‑identification steps, practical emergency recovery actions, monitoring enhancements, concise emergency‑plan design, and introduces intelligent event‑handling to help operations teams resolve incidents faster and more reliably.

Operationsautomationcall center
0 likes · 13 min read
How to Speed Up Call Center Incident Recovery with Proven Ops Strategies
Open Source Linux
Open Source Linux
Mar 8, 2022 · Operations

Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs

This article breaks down Kubernetes troubleshooting into three essential steps—understanding the failure, managing the response, and preventing recurrence—while mapping key monitoring, observability, and incident‑response tools to each phase for reliable cloud‑native operations.

KubernetesObservabilityOperations
0 likes · 8 min read
Master Kubernetes Troubleshooting: The Three Pillars Every Engineer Needs
Efficient Ops
Efficient Ops
Dec 5, 2021 · Operations

Mastering ITIL Event Management: Strategies for Efficient IT Operations

This article explores the fundamentals of ITIL-based event management, detailing its relationship with ITSM, the challenges of unmanaged services, key processes, priority definitions, and three management models—centralized, self‑managed, and collaborative—to help organizations improve service stability and response efficiency.

ITILITSMIncident Prioritization
0 likes · 14 min read
Mastering ITIL Event Management: Strategies for Efficient IT Operations
TAL Education Technology
TAL Education Technology
Aug 19, 2021 · Operations

Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform

This document outlines a comprehensive SRE‑driven operational framework for ensuring stable, high‑availability online education services during peak summer and winter periods, detailing pre‑, during‑, and post‑maintenance phases, architectural principles, load testing, monitoring, capacity management, safety hardening, chaos engineering, incident response, and post‑mortem practices.

Load TestingSREcapacity planning
0 likes · 17 min read
Comprehensive SRE Guide for Summer and Winter High‑Load Periods in an Online Education Platform
ByteDance SE Lab
ByteDance SE Lab
Jul 30, 2021 · Operations

Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It

The article examines Salesforce’s five‑hour global outage caused by a shortcut DNS deployment and the subsequent recovery challenges, then explores a viral experiment where twenty smartphones generated artificial traffic congestion, illustrating how real‑time data feeds and operational safeguards can prevent large‑scale service disruptions.

Big DataOperationsSaaS
0 likes · 7 min read
Inside Salesforce’s Global Outage: What Went Wrong and How to Prevent It
DevOps
DevOps
Jul 28, 2021 · Operations

Improving System Availability: Stages, Influencing Factors, and Practical Measures

This article explains system availability, outlines three stages of incident handling, identifies key factors that degrade availability such as human error, avalanche effects, untested releases and infrastructure failures, and proposes technical and team‑oriented practices to enhance reliability and achieve higher "nines" of uptime.

OperationsReliabilityincident management
0 likes · 11 min read
Improving System Availability: Stages, Influencing Factors, and Practical Measures
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Jun 19, 2021 · Operations

Solving Monitoring Pain Points: Unified Framework, Alert Prioritization, and Classification

The article discusses common monitoring challenges such as fragmented tooling and noisy alerts, and proposes solutions including consolidating to a single monitoring framework, prioritizing runtime exceptions, and classifying business alerts with codes and trace information to improve incident response.

AlertingObservabilitybest-practices
0 likes · 6 min read
Solving Monitoring Pain Points: Unified Framework, Alert Prioritization, and Classification
Alibaba Cloud Native
Alibaba Cloud Native
May 24, 2021 · Operations

How to Build a Data‑Driven Stability Assurance System for Kubernetes Clusters

This article presents a systematic, data‑model‑driven approach to Kubernetes stability assurance, detailing the sources of complexity, a four‑diagram and three‑table data model, insight and pre‑plan structures, global visualisation concepts, deployment patterns, operational workflows, and competitive analysis to enable effective, iterative, and sustainable cluster stability management.

Kubernetesdata modelingincident management
0 likes · 15 min read
How to Build a Data‑Driven Stability Assurance System for Kubernetes Clusters
dbaplus Community
dbaplus Community
May 18, 2021 · Operations

Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation

This guide explains why monitoring is essential throughout a product lifecycle, outlines monitoring modes and methods, compares health checks, logs, tracing and metric solutions, and provides a detailed Prometheus‑based monitoring architecture with concrete metric definitions, alerting rules, and incident‑response procedures.

AlertingOperationsPrometheus
0 likes · 25 min read
Mastering End‑to‑End Monitoring: From Purpose to Prometheus Implementation
macrozheng
macrozheng
Apr 24, 2021 · Operations

How a Single Code Change Caused Million-Dollar Loss and What It Taught Me About Release Discipline

A routine release introduced a tiny code change that triggered a massive production outage, causing millions in losses; the team’s swift rollback, post‑mortem analysis, and reflections on code discipline, testing, and process compliance highlight essential lessons for reliable backend operations.

code qualityincident managementrelease process
0 likes · 9 min read
How a Single Code Change Caused Million-Dollar Loss and What It Taught Me About Release Discipline
Alibaba Cloud Developer
Alibaba Cloud Developer
Jan 27, 2021 · Operations

How to Build Sustainable System Stability: Architecture, Ops, and Team Practices

This article shares practical insights from a technical leader on designing robust system architecture, implementing comprehensive capacity planning, establishing reliable operations processes, strengthening security, and cultivating team awareness to achieve long‑term stability for large‑scale internet services.

OperationsSoftware Engineeringarchitecture design
0 likes · 24 min read
How to Build Sustainable System Stability: Architecture, Ops, and Team Practices
MaGe Linux Operations
MaGe Linux Operations
Jan 24, 2021 · Operations

How to Speed Up Call Center Incident Resolution with Proven Ops Strategies

This article walks through a real call‑center outage, outlines why traditional ad‑hoc debugging fails, and presents a structured approach—including symptom identification, rapid root‑cause isolation, enhanced monitoring, concise emergency playbooks, and intelligent automation—to dramatically reduce recovery time and move toward self‑healing operations.

automationcall centeremergency plan
0 likes · 13 min read
How to Speed Up Call Center Incident Resolution with Proven Ops Strategies
ITPUB
ITPUB
Oct 9, 2020 · Operations

How to Streamline Call Center Incident Management: Practical Steps and Best Practices

This guide walks through a real‑world call‑center slowdown incident, outlines common fault‑handling techniques, proposes monitoring enhancements, details a comprehensive emergency‑response plan, and introduces intelligent event‑processing concepts to help operations teams resolve outages faster and more reliably.

Operationsautomationcall center
0 likes · 15 min read
How to Streamline Call Center Incident Management: Practical Steps and Best Practices
Open Source Linux
Open Source Linux
Sep 12, 2020 · Operations

Mastering Incident Response: Core Principles and Practical Methods

This guide outlines essential incident‑response principles—prioritizing business restoration and timely escalation—while detailing practical methods such as restart, isolation, and degradation, and explains how to organize response teams and conduct thorough post‑incident reviews.

IsolationRestartdegradation
0 likes · 11 min read
Mastering Incident Response: Core Principles and Practical Methods
Efficient Ops
Efficient Ops
Sep 9, 2020 · Operations

Mastering Incident Management: Core Principles and Practical Methods

This guide outlines essential incident management principles—prioritizing business restoration and timely escalation—followed by detailed methodologies such as restart, isolation, and degradation, and explains role responsibilities, user impact handling, and post‑incident summarization for continuous improvement.

Operationsfault handlingincident management
0 likes · 10 min read
Mastering Incident Management: Core Principles and Practical Methods
Efficient Ops
Efficient Ops
Sep 8, 2020 · Operations

From Firefighting to Arson: Mastering Ops Availability in Three Stages

The article outlines a three‑stage ops maturity model—firefighting, fire prevention, and arson—explains how proactive fault‑injection drills, continuous availability improvements, and aligning technical metrics with business value can transform operations from reactive responders into strategic value creators.

AvailabilityFault InjectionOperations
0 likes · 8 min read
From Firefighting to Arson: Mastering Ops Availability in Three Stages
Didi Tech
Didi Tech
Jun 3, 2020 · Backend Development

Stability Guidelines and Anti‑Patterns for Backend Services

Drawing on five years of incident reviews, the article defines a comprehensive stability framework for backend services—mandating timeout hierarchies, weak dependencies, service-discovery integration, staged gray releases, robust monitoring, capacity planning, and strict change management—while cataloguing common anti-patterns such as over-aggressive circuit breaking, static retries, improper timeouts, tight coupling, and insufficient isolation, and urging regular rehearsal of these practices.

backend stabilitydeployment best practicesincident management
0 likes · 21 min read
Stability Guidelines and Anti‑Patterns for Backend Services
Efficient Ops
Efficient Ops
Apr 12, 2020 · Operations

Master Incident Management: Definitions, Processes, and Best Practices

This guide explains fault management fundamentals—from ITIL‑based definitions and why it matters, to fault level classification, monitoring, emergency response, recovery, post‑mortem analysis, continuous improvement, and practical advice for practitioners—providing a comprehensive, actionable framework for reliable operations.

Continuous ImprovementITILOperations
0 likes · 11 min read
Master Incident Management: Definitions, Processes, and Best Practices
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 6, 2020 · Operations

Google Incident Postmortem Checklist

The article presents a detailed Google‑derived post‑mortem checklist covering event data collection, root‑cause analysis, lessons learned, actionable improvement items, and review procedures to ensure systematic, non‑blame‑focused incident handling.

OperationsRoot Cause Analysisaction items
0 likes · 5 min read
Google Incident Postmortem Checklist
360 Tech Engineering
360 Tech Engineering
Oct 31, 2019 · Operations

AIOps Implementation Practice at 360: Architecture, Models, and Automation

The article details 360's AIOps deployment, covering external speaker insights, internal architecture, data collection pipelines, AI models for resource recycling, alarm reduction, and correlation, as well as visualization dashboards, labeling platforms, and self‑healing mechanisms, illustrating a comprehensive AI‑driven operations framework.

AI MonitoringKnowledge GraphOperations Automation
0 likes · 14 min read
AIOps Implementation Practice at 360: Architecture, Models, and Automation
dbaplus Community
dbaplus Community
Oct 16, 2019 · Operations

How to Cut Alert Noise: Practical SRE Strategies for Ops Teams

This article shares concrete SRE‑inspired techniques—duty‑roster scheduling, tiered alert handling, automation safeguards, dashboard focus on top‑3 alerts, time‑based filtering, and systematic code review—to dramatically reduce daily alarm volume while keeping on‑call teams motivated and effective.

On-CallSREalert optimization
0 likes · 15 min read
How to Cut Alert Noise: Practical SRE Strategies for Ops Teams