Tagged articles
21 articles
Page 1 of 1
dbaplus Community
dbaplus Community
Sep 13, 2024 · Operations

How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability

This article explains how Bilibili designed and implemented an Emergency Response Center (ERC) to manage the full fault lifecycle—detection, response, delimitation, localization, mitigation and recovery—using alert rules, automated recall, integrated customer feedback, clear role assignments, mobile support, and data‑driven post‑mortems, ultimately reducing MTTR and improving service reliability.

AutomationMTTRSRE
0 likes · 23 min read
How Bilibili Built an Emergency Response Center to Slash MTTR and Boost System Stability
Bilibili Tech
Bilibili Tech
Aug 16, 2024 · Operations

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

Bilibili’s Emergency Response Center (ERC) unifies incident detection, rapid response, precise scoping, and coordinated recovery through multi‑dimensional alerts, automated collaboration, standardized updates, and post‑mortem analysis, targeting 1‑minute detection, 3‑minute response, 5‑minute scoping, and 10‑minute recovery, which has cut MTTR, achieved over 80% automatic recall accuracy, and met more than 60% of its 1‑3‑5‑10 performance goals.

AutomationMTTRSRE
0 likes · 22 min read
Design and Implementation of Bilibili's Emergency Response Center for Incident Management
Efficient Ops
Efficient Ops
Jan 29, 2024 · Operations

Mastering Incident Response: A Practical Guide to Faster Service Recovery

This guide walks ops teams through real‑world incident scenarios, from quick symptom identification and emergency recovery to improving monitoring, crafting concise emergency plans, and leveraging automation for smarter fault handling, helping organizations restore services faster and reduce downtime.

emergency planningfault handlingincident response
0 likes · 14 min read
Mastering Incident Response: A Practical Guide to Faster Service Recovery
Efficient Ops
Efficient Ops
Jan 14, 2024 · Operations

Mastering Incident Command: A Practical Guide for SRE Fault Handling

This article outlines a comprehensive, step‑by‑step approach for SRE incident commanders, covering fault perception, grading, team organization, remediation tactics, transparent communication, and post‑mortem practices to efficiently resolve service disruptions.

SREfault handlingincident management
0 likes · 14 min read
Mastering Incident Command: A Practical Guide for SRE Fault Handling
Baidu Geek Talk
Baidu Geek Talk
Nov 22, 2023 · Operations

Stability Assurance for Baidu Search Aladdin during Large-Scale Events

Baidu’s Aladdin search service safeguards stability during massive traffic spikes—such as Gaokao, the Tokyo and Beijing Olympics—by mapping dependencies, deploying multi‑dimensional monitoring, adding scaling layers like multi‑region Redis, and establishing rapid‑response on‑call teams, achieving over 99.99 % uptime and near‑real‑time data updates.

backend operationsfault handlinglarge-scale traffic
0 likes · 9 min read
Stability Assurance for Baidu Search Aladdin during Large-Scale Events
dbaplus Community
dbaplus Community
Jun 5, 2023 · Operations

Mastering Production Faults: Diagnose and Fix Network, Server, Database Issues

This guide outlines the most common production failures—including network, server, database, software, security, storage, configuration, and third‑party service issues—and provides step‑by‑step methods to detect, troubleshoot, and resolve each problem, helping maintain system stability and reliability.

OperationsServerdatabase
0 likes · 30 min read
Mastering Production Faults: Diagnose and Fix Network, Server, Database Issues
Wukong Talks Architecture
Wukong Talks Architecture
May 17, 2023 · Operations

Common Production Faults and Their Handling Guide

This guide outlines the most common production failures—including network, server, database, software, security, storage, configuration, and third‑party service issues—and provides detailed steps for detecting, diagnosing, and resolving each type to maintain system stability and reliability.

Operationsfault handlingproduction
0 likes · 30 min read
Common Production Faults and Their Handling Guide
Top Architect
Top Architect
Nov 6, 2022 · Backend Development

Comprehensive Guide to Backend Development: System Design, Architecture, Network Communication, Fault Handling, Monitoring, Service Governance and Deployment

This article provides a thorough overview of backend development, covering system development principles, architectural design patterns, network communication techniques, common faults and exceptions, monitoring and alerting strategies, service governance practices, and deployment workflows, all illustrated with clear explanations and practical examples.

BackendDeploymentSystem Design
0 likes · 33 min read
Comprehensive Guide to Backend Development: System Design, Architecture, Network Communication, Fault Handling, Monitoring, Service Governance and Deployment
Efficient Ops
Efficient Ops
Aug 2, 2022 · Operations

Mastering Incident Response: Principles and Methods for Effective Operations

This guide outlines essential incident‑response principles—prioritizing business recovery and timely escalation—and presents practical methods such as restart, isolation, and downgrade, while also detailing stakeholder roles and post‑incident review practices for reliable operations.

IsolationService Restartdowngrade
0 likes · 10 min read
Mastering Incident Response: Principles and Methods for Effective Operations
Top Architect
Top Architect
Aug 2, 2022 · Operations

Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems

This article presents a comprehensive guide on diagnosing, monitoring, and quickly resolving call‑center system failures, covering common troubleshooting steps, monitoring enhancements, emergency‑plan design, and intelligent event‑handling techniques to improve operational reliability and response speed.

Operationsemergency responsefault handling
0 likes · 15 min read
Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems
MaGe Linux Operations
MaGe Linux Operations
Jan 24, 2021 · Operations

How to Speed Up Call Center Incident Resolution with Proven Ops Strategies

This article walks through a real call‑center outage, outlines why traditional ad‑hoc debugging fails, and presents a structured approach—including symptom identification, rapid root‑cause isolation, enhanced monitoring, concise emergency playbooks, and intelligent automation—to dramatically reduce recovery time and move toward self‑healing operations.

Automationcall centeremergency plan
0 likes · 13 min read
How to Speed Up Call Center Incident Resolution with Proven Ops Strategies
Efficient Ops
Efficient Ops
Dec 1, 2020 · Operations

Zero‑Downtime Ops: Inside Tencent’s Panshi High‑Availability Platform

At the 2020 GOPS Global Operations Conference, Tencent’s senior operations engineer Xie Hailin detailed the design and implementation of the Panshi platform—a comprehensive, high‑availability solution that unifies change management, fault handling, continuous operation, and disaster recovery to ensure uninterrupted payment services for billions of daily transactions.

Operationsaiopschange management
0 likes · 24 min read
Zero‑Downtime Ops: Inside Tencent’s Panshi High‑Availability Platform
ITPUB
ITPUB
Oct 9, 2020 · Operations

How to Streamline Call Center Incident Management: Practical Steps and Best Practices

This guide walks through a real‑world call‑center slowdown incident, outlines common fault‑handling techniques, proposes monitoring enhancements, details a comprehensive emergency‑response plan, and introduces intelligent event‑processing concepts to help operations teams resolve outages faster and more reliably.

AutomationOperationscall center
0 likes · 15 min read
How to Streamline Call Center Incident Management: Practical Steps and Best Practices
Open Source Linux
Open Source Linux
Sep 12, 2020 · Operations

Mastering Incident Response: Core Principles and Practical Methods

This guide outlines essential incident‑response principles—prioritizing business restoration and timely escalation—while detailing practical methods such as restart, isolation, and degradation, and explains how to organize response teams and conduct thorough post‑incident reviews.

IsolationRestartdegradation
0 likes · 11 min read
Mastering Incident Response: Core Principles and Practical Methods
Efficient Ops
Efficient Ops
Sep 9, 2020 · Operations

Mastering Incident Management: Core Principles and Practical Methods

This guide outlines essential incident management principles—prioritizing business restoration and timely escalation—followed by detailed methodologies such as restart, isolation, and degradation, and explains role responsibilities, user impact handling, and post‑incident summarization for continuous improvement.

Operationsfault handlingincident management
0 likes · 10 min read
Mastering Incident Management: Core Principles and Practical Methods
Efficient Ops
Efficient Ops
Apr 12, 2020 · Operations

Master Incident Management: Definitions, Processes, and Best Practices

This guide explains fault management fundamentals—from ITIL‑based definitions and why it matters, to fault level classification, monitoring, emergency response, recovery, post‑mortem analysis, continuous improvement, and practical advice for practitioners—providing a comprehensive, actionable framework for reliable operations.

Continuous ImprovementITILOperations
0 likes · 11 min read
Master Incident Management: Definitions, Processes, and Best Practices
Efficient Ops
Efficient Ops
Oct 29, 2018 · Operations

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

This article outlines Youzan's end‑to‑end online incident management process—from fault detection and coordination through root‑cause analysis, recovery, review, and actionable JIRA tracking—highlighting practical workflows, data analysis, and continuous improvement practices for reliable service delivery.

JIRA workflowOperationsfault handling
0 likes · 10 min read
How Youzan Manages Online Incidents: A Step‑by‑Step Guide
Qunar Tech Salon
Qunar Tech Salon
Aug 18, 2017 · Operations

Hardware Automation Operations System at Qunar: Design, Implementation, and Lessons Learned

This article details Qunar's hardware automation operations platform, covering the hardware scope, pain points of manual processes, a five‑stage lifecycle, automated testing, data collection, fault handling, and the underlying Mesos‑Marathon‑Docker infrastructure that together improve efficiency, reliability, and cost control.

data collectionfault handlinghardware automation
0 likes · 21 min read
Hardware Automation Operations System at Qunar: Design, Implementation, and Lessons Learned