Tag

postmortem

0 views collected around this technical thread.

Test Development Learning Exchange
Test Development Learning Exchange
Oct 10, 2024 · R&D Management

Comprehensive Guide to Bug Incident Management, Post‑mortem, and Process Improvement

This guide outlines a complete workflow for handling software bugs—from immediate reporting and triage through impact assessment, resolution strategies, post‑incident analysis, and long‑term process, testing, and organizational improvements—to ensure stable releases and continuous quality enhancement.

Software Developmentbug managementci/cd
0 likes · 14 min read
Comprehensive Guide to Bug Incident Management, Post‑mortem, and Process Improvement
Continuous Delivery 2.0
Continuous Delivery 2.0
Aug 13, 2024 · R&D Management

Is Amazon's COE Process Really Effective? Insights from SDEs

The article examines Amazon's Correction of Errors (COE) process, presenting both supportive and critical SDE perspectives, and discusses whether detailed post‑incident documentation truly improves engineering practices or merely adds bureaucratic overhead.

CoEEngineering CultureSDE
0 likes · 10 min read
Is Amazon's COE Process Really Effective? Insights from SDEs
Architecture and Beyond
Architecture and Beyond
Dec 2, 2023 · Operations

Postmortem Analysis of the Yuque Service Outage and Lessons on Complex Systems and the KISS Principle

The article reviews the October 23 Yuque service outage, analyzes root causes such as a buggy upgrade tool and outdated storage, extracts operational lessons on testing, disaster recovery, high‑availability, communication, and advocates the KISS principle to simplify complex systems for improved reliability.

KISS principlecomplex systemsoperations
0 likes · 10 min read
Postmortem Analysis of the Yuque Service Outage and Lessons on Complex Systems and the KISS Principle
HelloTech
HelloTech
Nov 22, 2022 · Operations

Guidelines for Incident Postmortem and Fault Review

The incident postmortem guideline advocates a dialectical view of failures, rapid low‑severity recovery, and a structured process—covering background, impact scope, timeline replay, deep root‑cause analysis, SMART improvement actions, responsibility assignment, and PDCA‑validated closure—to enhance system resilience, team anti‑fragility, and knowledge sharing.

MTBFMTTRhigh availability
0 likes · 15 min read
Guidelines for Incident Postmortem and Fault Review
Big Data Technology Architecture
Big Data Technology Architecture
Jul 14, 2022 · Operations

Postmortem of Bilibili SLB Outage on July 13, 2021

This postmortem details the July 13, 2021 Bilibili outage caused by a Lua‑induced CPU 100% bug in the OpenResty‑based SLB, describing the incident timeline, root‑cause analysis, mitigation steps, and the subsequent technical and process improvements to enhance reliability and multi‑active deployment.

IncidentSLBSRE
0 likes · 16 min read
Postmortem of Bilibili SLB Outage on July 13, 2021
Efficient Ops
Efficient Ops
Jul 11, 2021 · Operations

Mastering Incident Management: Principles and Methods for Effective Fault Handling

This guide outlines essential incident management principles—prioritizing business recovery and timely escalation—and presents practical methodologies such as restart, isolation, and downgrade, while detailing user impact handling, organizational roles, and post‑mortem best practices.

escalationfault handlingincident management
0 likes · 10 min read
Mastering Incident Management: Principles and Methods for Effective Fault Handling
macrozheng
macrozheng
Jan 12, 2021 · R&D Management

Mastering Internet Project Workflow: Roles, Timelines, and Postmortems

This article outlines the typical roles in an internet company, explains how cross‑department collaboration, timeline definition, resource allocation, development‑testing‑deployment phases, and comprehensive project retrospectives work together to ensure successful project delivery.

DeploymentSoftware Developmentpostmortem
0 likes · 10 min read
Mastering Internet Project Workflow: Roles, Timelines, and Postmortems
Efficient Ops
Efficient Ops
Apr 12, 2020 · Operations

Master Incident Management: Definitions, Processes, and Best Practices

This guide explains fault management fundamentals—from ITIL‑based definitions and why it matters, to fault level classification, monitoring, emergency response, recovery, post‑mortem analysis, continuous improvement, and practical advice for practitioners—providing a comprehensive, actionable framework for reliable operations.

ITILMonitoringcontinuous improvement
0 likes · 11 min read
Master Incident Management: Definitions, Processes, and Best Practices
Continuous Delivery 2.0
Continuous Delivery 2.0
Mar 6, 2020 · Operations

Google Incident Postmortem Checklist

The article presents a detailed Google‑derived post‑mortem checklist covering event data collection, root‑cause analysis, lessons learned, actionable improvement items, and review procedures to ensure systematic, non‑blame‑focused incident handling.

action itemschecklistincident management
0 likes · 5 min read
Google Incident Postmortem Checklist
Qunar Tech Salon
Qunar Tech Salon
Feb 4, 2017 · Operations

GitLab Database Deletion Incident: Lessons on Backup, Operations, and High‑Availability Design

The article recounts a GitLab production database deletion caused by a mistaken command, analyzes why the backup mechanisms failed, and offers technical and cultural recommendations—including automation, proper replication, and transparent post‑mortems—to build more reliable, high‑availability systems.

GitLabPostgreSQLbackup
0 likes · 15 min read
GitLab Database Deletion Incident: Lessons on Backup, Operations, and High‑Availability Design
Efficient Ops
Efficient Ops
Feb 2, 2017 · Operations

GitLab.com Database Disaster: How a Mistyped rm Command Wiped 300GB and What We Learned

GitLab.com suffered a catastrophic database outage on February 1, 2017 when an exhausted operator mistakenly ran a destructive rm command on the wrong server, wiping most production data; the incident’s timeline, root causes, recovery steps, and lessons learned are detailed in this post‑mortem.

DevOpsGitLabPostgreSQL
0 likes · 12 min read
GitLab.com Database Disaster: How a Mistyped rm Command Wiped 300GB and What We Learned
Efficient Ops
Efficient Ops
Oct 23, 2016 · Operations

How Google’s SRE Postmortems Drive System Reliability

This article explains Google’s SRE postmortem philosophy, the criteria for writing postmortems, best practices for a blame‑free culture, and how collaborative knowledge‑sharing and incentives improve incident handling and overall system reliability.

SREincident managementoperations
0 likes · 14 min read
How Google’s SRE Postmortems Drive System Reliability
Efficient Ops
Efficient Ops
Sep 13, 2016 · Operations

How Google SRE Principles Compare Across Industries

This article, excerpted from the upcoming Chinese edition of “SRE: Google Site Reliability Engineering”, examines how Google’s SRE guiding philosophies—disaster planning, post‑mortem culture, automation, and data‑driven decision‑making—are adopted, adapted, or contrasted in sectors such as manufacturing, aerospace, nuclear, telecommunications, healthcare, and finance, highlighting key similarities, differences, and lessons for Google and the broader tech industry.

AutomationSREincident response
0 likes · 21 min read
How Google SRE Principles Compare Across Industries
Efficient Ops
Efficient Ops
Sep 10, 2015 · Operations

How Google’s DevOps Practices Enable Instant Issue Detection and Swarming Resolution

This article explores Randy Shoup’s interview on Google’s DevOps culture, detailing how high‑efficiency organizations instantly detect problems, use swarming to resolve them, document lessons as new knowledge, and foster a blameless post‑mortem culture that drives continuous improvement.

Automation TestingDevOpsGoogle
0 likes · 14 min read
How Google’s DevOps Practices Enable Instant Issue Detection and Swarming Resolution