Tag

fault handling

0 views collected around this technical thread.

Bilibili Tech
Bilibili Tech
Aug 16, 2024 · Operations

Design and Implementation of Bilibili's Emergency Response Center for Incident Management

Bilibili’s Emergency Response Center (ERC) unifies incident detection, rapid response, precise scoping, and coordinated recovery through multi‑dimensional alerts, automated collaboration, standardized updates, and post‑mortem analysis, targeting 1‑minute detection, 3‑minute response, 5‑minute scoping, and 10‑minute recovery, which has cut MTTR, achieved over 80% automatic recall accuracy, and met more than 60% of its 1‑3‑5‑10 performance goals.

Incident ManagementMTTRSRE
0 likes · 22 min read
Design and Implementation of Bilibili's Emergency Response Center for Incident Management
Efficient Ops
Efficient Ops
Jan 29, 2024 · Operations

Mastering Incident Response: A Practical Guide to Faster Service Recovery

This guide walks ops teams through real‑world incident scenarios, from quick symptom identification and emergency recovery to improving monitoring, crafting concise emergency plans, and leveraging automation for smarter fault handling, helping organizations restore services faster and reduce downtime.

emergency planningfault handlingincident response
0 likes · 14 min read
Mastering Incident Response: A Practical Guide to Faster Service Recovery
Efficient Ops
Efficient Ops
Jan 14, 2024 · Operations

Mastering Incident Command: A Practical Guide for SRE Fault Handling

This article outlines a comprehensive, step‑by‑step approach for SRE incident commanders, covering fault perception, grading, team organization, remediation tactics, transparent communication, and post‑mortem practices to efficiently resolve service disruptions.

Incident ManagementSREfault handling
0 likes · 14 min read
Mastering Incident Command: A Practical Guide for SRE Fault Handling
Baidu Geek Talk
Baidu Geek Talk
Nov 22, 2023 · Operations

Stability Assurance for Baidu Search Aladdin during Large-Scale Events

Baidu’s Aladdin search service safeguards stability during massive traffic spikes—such as Gaokao, the Tokyo and Beijing Olympics—by mapping dependencies, deploying multi‑dimensional monitoring, adding scaling layers like multi‑region Redis, and establishing rapid‑response on‑call teams, achieving over 99.99 % uptime and near‑real‑time data updates.

backend operationsfault handlinglarge-scale traffic
0 likes · 9 min read
Stability Assurance for Baidu Search Aladdin during Large-Scale Events
Wukong Talks Architecture
Wukong Talks Architecture
May 17, 2023 · Operations

Common Production Faults and Their Handling Guide

This guide outlines the most common production failures—including network, server, database, software, security, storage, configuration, and third‑party service issues—and provides detailed steps for detecting, diagnosing, and resolving each type to maintain system stability and reliability.

ProductionTroubleshootingfault handling
0 likes · 30 min read
Common Production Faults and Their Handling Guide
Top Architect
Top Architect
Nov 6, 2022 · Backend Development

Comprehensive Guide to Backend Development: System Design, Architecture, Network Communication, Fault Handling, Monitoring, Service Governance and Deployment

This article provides a thorough overview of backend development, covering system development principles, architectural design patterns, network communication techniques, common faults and exceptions, monitoring and alerting strategies, service governance practices, and deployment workflows, all illustrated with clear explanations and practical examples.

architecturedeploymentfault handling
0 likes · 33 min read
Comprehensive Guide to Backend Development: System Design, Architecture, Network Communication, Fault Handling, Monitoring, Service Governance and Deployment
Efficient Ops
Efficient Ops
Aug 2, 2022 · Operations

Mastering Incident Response: Principles and Methods for Effective Operations

This guide outlines essential incident‑response principles—prioritizing business recovery and timely escalation—and presents practical methods such as restart, isolation, and downgrade, while also detailing stakeholder roles and post‑incident review practices for reliable operations.

downgradeescalationfault handling
0 likes · 10 min read
Mastering Incident Response: Principles and Methods for Effective Operations
Top Architect
Top Architect
Aug 2, 2022 · Operations

Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems

This article presents a comprehensive guide on diagnosing, monitoring, and quickly resolving call‑center system failures, covering common troubleshooting steps, monitoring enhancements, emergency‑plan design, and intelligent event‑handling techniques to improve operational reliability and response speed.

Incident Managementemergency responsefault handling
0 likes · 15 min read
Effective Fault Handling, Monitoring, and Emergency Response for Call‑Center Systems
Efficient Ops
Efficient Ops
Jul 11, 2021 · Operations

Mastering Incident Management: Principles and Methods for Effective Fault Handling

This guide outlines essential incident management principles—prioritizing business recovery and timely escalation—and presents practical methodologies such as restart, isolation, and downgrade, while detailing user impact handling, organizational roles, and post‑mortem best practices.

Incident Managementescalationfault handling
0 likes · 10 min read
Mastering Incident Management: Principles and Methods for Effective Fault Handling
Efficient Ops
Efficient Ops
Dec 1, 2020 · Operations

Zero‑Downtime Ops: Inside Tencent’s Panshi High‑Availability Platform

At the 2020 GOPS Global Operations Conference, Tencent’s senior operations engineer Xie Hailin detailed the design and implementation of the Panshi platform—a comprehensive, high‑availability solution that unifies change management, fault handling, continuous operation, and disaster recovery to ensure uninterrupted payment services for billions of daily transactions.

AIOpsHigh Availabilitychange management
0 likes · 24 min read
Zero‑Downtime Ops: Inside Tencent’s Panshi High‑Availability Platform
Efficient Ops
Efficient Ops
Sep 9, 2020 · Operations

Mastering Incident Management: Core Principles and Practical Methods

This guide outlines essential incident management principles—prioritizing business restoration and timely escalation—followed by detailed methodologies such as restart, isolation, and degradation, and explains role responsibilities, user impact handling, and post‑incident summarization for continuous improvement.

Incident Managementfault handlingoperations
0 likes · 10 min read
Mastering Incident Management: Core Principles and Practical Methods
Efficient Ops
Efficient Ops
Apr 12, 2020 · Operations

Master Incident Management: Definitions, Processes, and Best Practices

This guide explains fault management fundamentals—from ITIL‑based definitions and why it matters, to fault level classification, monitoring, emergency response, recovery, post‑mortem analysis, continuous improvement, and practical advice for practitioners—providing a comprehensive, actionable framework for reliable operations.

ITILIncident Managementcontinuous improvement
0 likes · 11 min read
Master Incident Management: Definitions, Processes, and Best Practices
Efficient Ops
Efficient Ops
Oct 29, 2018 · Operations

How Youzan Manages Online Incidents: A Step‑by‑Step Guide

This article outlines Youzan's end‑to‑end online incident management process—from fault detection and coordination through root‑cause analysis, recovery, review, and actionable JIRA tracking—highlighting practical workflows, data analysis, and continuous improvement practices for reliable service delivery.

Incident ManagementJIRA workflowfault handling
0 likes · 10 min read
How Youzan Manages Online Incidents: A Step‑by‑Step Guide
Qunar Tech Salon
Qunar Tech Salon
Aug 18, 2017 · Operations

Hardware Automation Operations System at Qunar: Design, Implementation, and Lessons Learned

This article details Qunar's hardware automation operations platform, covering the hardware scope, pain points of manual processes, a five‑stage lifecycle, automated testing, data collection, fault handling, and the underlying Mesos‑Marathon‑Docker infrastructure that together improve efficiency, reliability, and cost control.

data collectionfault handlinghardware automation
0 likes · 21 min read
Hardware Automation Operations System at Qunar: Design, Implementation, and Lessons Learned