Tagged articles

fault handling

21 articles · Page 1 of 1

Sep 13, 2024 · Operations

How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability

This article explains how Bilibili designed and implemented an Emergency Response Center (ERC) to manage the full fault lifecycle—detection, response, delimitation, localization, mitigation and recovery—using alert rules, automated recall, integrated customer feedback, clear role assignments, mobile support, and data‑driven post‑mortems, ultimately reducing MTTR and improving service reliability.

AutomationIncident ManagementMTTR

0 likes · 23 min read

How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability

Bilibili Tech

Aug 16, 2024 · Operations

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

Bilibili’s Emergency Response Center (ERC) unifies incident detection, rapid response, precise scoping, and coordinated recovery through multi‑dimensional alerts, automated collaboration, standardized updates, and post‑mortem analysis, targeting 1‑minute detection, 3‑minute response, 5‑minute scoping, and 10‑minute recovery, which has cut MTTR, achieved over 80% automatic recall accuracy, and met more than 60% of its 1‑3‑5‑10 performance goals.

AutomationIncident ManagementMTTR

0 likes · 22 min read

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

Open Source Linux

Jun 28, 2024 · Operations

Mastering Incident Responsibility: Proven Tactics to Navigate Fault Discussions

This article outlines practical principles and communication techniques for assigning responsibility during system failures, emphasizing strategic questioning, ally‑building, moral positioning, and nuanced response methods to protect oneself while ensuring effective incident resolution.

Operationscommunicationfault handling

0 likes · 8 min read

Mastering Incident Responsibility: Proven Tactics to Navigate Fault Discussions

Efficient Ops

Jan 29, 2024 · Operations

Mastering Incident Response: A Practical Guide to Faster Service Recovery

This guide walks ops teams through real‑world incident scenarios, from quick symptom identification and emergency recovery to improving monitoring, crafting concise emergency plans, and leveraging automation for smarter fault handling, helping organizations restore services faster and reduce downtime.

Monitoringemergency planningfault handling

0 likes · 14 min read

Mastering Incident Response: A Practical Guide to Faster Service Recovery

Efficient Ops

Jan 14, 2024 · Operations

Mastering Incident Command: A Practical Guide for SRE Fault Handling

This article outlines a comprehensive, step‑by‑step approach for SRE incident commanders, covering fault perception, grading, team organization, remediation tactics, transparent communication, and post‑mortem practices to efficiently resolve service disruptions.

Incident ManagementMonitoringSRE

0 likes · 14 min read

Mastering Incident Command: A Practical Guide for SRE Fault Handling

Baidu Geek Talk

Nov 22, 2023 · Operations

Stability Assurance for Baidu Search Aladdin during Large-Scale Events

Baidu’s Aladdin search service safeguards stability during massive traffic spikes—such as Gaokao, the Tokyo and Beijing Olympics—by mapping dependencies, deploying multi‑dimensional monitoring, adding scaling layers like multi‑region Redis, and establishing rapid‑response on‑call teams, achieving over 99.99 % uptime and near‑real‑time data updates.

Monitoringbackend operationsfault handling

0 likes · 9 min read

Stability Assurance for Baidu Search Aladdin during Large-Scale Events

dbaplus Community

Jun 5, 2023 · Operations

Mastering Production Faults: Diagnose and Fix Network, Server, Database Issues

This guide outlines the most common production failures—including network, server, database, software, security, storage, configuration, and third‑party service issues—and provides step‑by‑step methods to detect, troubleshoot, and resolve each problem, helping maintain system stability and reliability.

OperationsProductionServer

0 likes · 30 min read

Mastering Production Faults: Diagnose and Fix Network, Server, Database Issues

Wukong Talks Architecture

May 17, 2023 · Operations

Common Production Faults and Their Handling Guide

This guide outlines the most common production failures—including network, server, database, software, security, storage, configuration, and third‑party service issues—and provides detailed steps for detecting, diagnosing, and resolving each type to maintain system stability and reliability.

OperationsProductionfault handling

0 likes · 30 min read

Common Production Faults and Their Handling Guide

Top Architect

Nov 6, 2022 · Backend Development

Comprehensive Guide to Backend Development: System Design, Architecture, Network Communication, Fault Handling, Monitoring, Service Governance and Deployment

This article provides a thorough overview of backend development, covering system development principles, architectural design patterns, network communication techniques, common faults and exceptions, monitoring and alerting strategies, service governance practices, and deployment workflows, all illustrated with clear explanations and practical examples.

MonitoringSystem Designbackend

0 likes · 33 min read

Comprehensive Guide to Backend Development: System Design, Architecture, Network Communication, Fault Handling, Monitoring, Service Governance and Deployment

Efficient Ops

Aug 2, 2022 · Operations

Mastering Incident Response: Principles and Methods for Effective Operations

This guide outlines essential incident‑response principles—prioritizing business recovery and timely escalation—and presents practical methods such as restart, isolation, and downgrade, while also detailing stakeholder roles and post‑incident review practices for reliable operations.

Service Restartdowngradeescalation

0 likes · 10 min read

Mastering Incident Response: Principles and Methods for Effective Operations

Top Architect

Aug 2, 2022 · Operations

Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems

This article presents a comprehensive guide on diagnosing, monitoring, and quickly resolving call‑center system failures, covering common troubleshooting steps, monitoring enhancements, emergency‑plan design, and intelligent event‑handling techniques to improve operational reliability and response speed.

Incident ManagementOperationsemergency response

0 likes · 15 min read

Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems

Java Interview Crash Guide

Mar 23, 2022 · Operations

How to Streamline Call Center Incident Management: Proven Steps and Monitoring Strategies

This article walks through a real‑world call‑center outage scenario, outlines practical fault‑handling methods, shows how to improve monitoring and alerting, and presents a comprehensive emergency response plan that helps operations teams resolve incidents faster and prevent future failures.

AutomationIncident Managementcall center

0 likes · 13 min read

How to Streamline Call Center Incident Management: Proven Steps and Monitoring Strategies

Alibaba Cloud Developer

May 20, 2021 · Operations

Mastering Production Incident Response: Structured Problem Solving and Key Roles

This guide explains how to design and practice a structured incident‑response process—defining problems, applying quick‑recovery steps, analyzing root causes, standardizing solutions, and assigning critical roles—to dramatically reduce production outage duration.

OperationsSREfault handling

0 likes · 11 min read

Mastering Production Incident Response: Structured Problem Solving and Key Roles

MaGe Linux Operations

Jan 24, 2021 · Operations

How to Speed Up Call Center Incident Resolution with Proven Ops Strategies

This article walks through a real call‑center outage, outlines why traditional ad‑hoc debugging fails, and presents a structured approach—including symptom identification, rapid root‑cause isolation, enhanced monitoring, concise emergency playbooks, and intelligent automation—to dramatically reduce recovery time and move toward self‑healing operations.

AutomationIncident Managementcall center

0 likes · 13 min read

How to Speed Up Call Center Incident Resolution with Proven Ops Strategies

Efficient Ops

Dec 1, 2020 · Operations

Zero‑Downtime Ops: Inside Tencent’s Panshi High‑Availability Platform

At the 2020 GOPS Global Operations Conference, Tencent’s senior operations engineer Xie Hailin detailed the design and implementation of the Panshi platform—a comprehensive, high‑availability solution that unifies change management, fault handling, continuous operation, and disaster recovery to ensure uninterrupted payment services for billions of daily transactions.

AIOpsChange ManagementHigh Availability

0 likes · 24 min read

Zero‑Downtime Ops: Inside Tencent’s Panshi High‑Availability Platform

ITPUB

Oct 9, 2020 · Operations

How to Streamline Call Center Incident Management: Practical Steps and Best Practices

This guide walks through a real‑world call‑center slowdown incident, outlines common fault‑handling techniques, proposes monitoring enhancements, details a comprehensive emergency‑response plan, and introduces intelligent event‑processing concepts to help operations teams resolve outages faster and more reliably.

AutomationIncident ManagementMonitoring

0 likes · 15 min read

How to Streamline Call Center Incident Management: Practical Steps and Best Practices

Open Source Linux

Sep 12, 2020 · Operations

Mastering Incident Response: Core Principles and Practical Methods

This guide outlines essential incident‑response principles—prioritizing business restoration and timely escalation—while detailing practical methods such as restart, isolation, and degradation, and explains how to organize response teams and conduct thorough post‑incident reviews.

Incident ManagementRestartdegradation

0 likes · 11 min read

Mastering Incident Response: Core Principles and Practical Methods

Efficient Ops

Sep 9, 2020 · Operations

Mastering Incident Management: Core Principles and Practical Methods

This guide outlines essential incident management principles—prioritizing business restoration and timely escalation—followed by detailed methodologies such as restart, isolation, and degradation, and explains role responsibilities, user impact handling, and post‑incident summarization for continuous improvement.

Incident ManagementOperationsfault handling

0 likes · 10 min read

Mastering Incident Management: Core Principles and Practical Methods

Efficient Ops

Apr 12, 2020 · Operations

Master Incident Management: Definitions, Processes, and Best Practices

This guide explains fault management fundamentals—from ITIL‑based definitions and why it matters, to fault level classification, monitoring, emergency response, recovery, post‑mortem analysis, continuous improvement, and practical advice for practitioners—providing a comprehensive, actionable framework for reliable operations.

ITILIncident ManagementOperations

0 likes · 11 min read

Master Incident Management: Definitions, Processes, and Best Practices

Efficient Ops

Oct 29, 2018 · Operations

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

This article outlines Youzan's end‑to‑end online incident management process—from fault detection and coordination through root‑cause analysis, recovery, review, and actionable JIRA tracking—highlighting practical workflows, data analysis, and continuous improvement practices for reliable service delivery.

Incident ManagementJIRA workflowOperations

0 likes · 10 min read

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

Qunar Tech Salon

Aug 18, 2017 · Operations

Hardware Automation Operations System at Qunar: Design, Implementation, and Lessons Learned

This article details Qunar's hardware automation operations platform, covering the hardware scope, pain points of manual processes, a five‑stage lifecycle, automated testing, data collection, fault handling, and the underlying Mesos‑Marathon‑Docker infrastructure that together improve efficiency, reliability, and cost control.

Monitoringdata collectionfault handling

0 likes · 21 min read

Hardware Automation Operations System at Qunar: Design, Implementation, and Lessons Learned