Tagged articles

postmortem

37 articles · Page 1 of 1

Apr 24, 2026 · Industry Insights

Anthropic Postmortem: Claude Code Decline Due to Product‑Layer Changes

Anthropic’s detailed postmortem explains that recent user‑perceived declines in Claude Code’s reasoning depth, context retention, and response length stemmed from three product‑layer adjustments—a lowered default reasoning effort, a caching bug that repeatedly cleared thinking, and an overly restrictive system prompt—rather than any degradation of the underlying model itself.

AI product engineeringAnthropicClaude Code

0 likes · 15 min read

Anthropic Postmortem: Claude Code Decline Due to Product‑Layer Changes

Raymond Ops

Apr 20, 2026 · Operations

How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates

This article presents a complete SRE on‑call handbook that defines alert severity levels, provides concrete Prometheus Alertmanager configurations, outlines a step‑by‑step response flow, details war‑room roles, escalation paths, handoff checklists, post‑mortem procedures, and dozens of ready‑to‑use templates to reduce MTTR and improve reliability.

Alert ManagementOn-CallOperations

0 likes · 27 min read

How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates

MaGe Linux Operations

Dec 10, 2025 · Operations

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

This handbook presents a complete, two‑year‑tested SRE on‑call process that defines alert severity tiers, response requirements, escalation paths, War‑Room roles, handoff schedules, post‑mortem procedures, and provides ready‑to‑use configuration snippets, checklists and templates to reduce MTTR and repeat incidents.

Alert ManagementOn-CallOperations

0 likes · 26 min read

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

Alibaba Cloud Developer

Oct 9, 2025 · Operations

How AI‑Powered Multi‑Agent Systems Turn Fault Postmortems into Proactive Risk Prevention

This article explains how an AI‑driven multi‑agent platform automates fault postmortem generation, enriches analysis with memory management, prompt engineering, and RAG techniques, and delivers actionable insights for SREs, developers, and non‑technical stakeholders, ultimately shifting incident handling from reactive to proactive.

AIAutomationIncident Management

0 likes · 44 min read

How AI‑Powered Multi‑Agent Systems Turn Fault Postmortems into Proactive Risk Prevention

ITPUB

Sep 30, 2025 · Operations

Turning Ops Chaos into Order: Postmortems, Tools, and AI‑Powered Assistants

This article explains why the chaotic nature of modern operations—spanning mixed‑technology stacks, cross‑domain tasks, and legacy‑new architecture battles—creates value, outlines a fair post‑mortem process, and introduces practical tools and AI agents such as LinuxMirrors, kubectl‑ai, Zread AI, and Lerwee that help turn disorder into reliable, automated workflows.

AI assistantLinuxdevops tools

0 likes · 11 min read

Turning Ops Chaos into Order: Postmortems, Tools, and AI‑Powered Assistants

Ops Community

Sep 16, 2025 · Operations

Mastering SRE: Fast Incident Response and Prevention Strategies

This guide walks SRE engineers through a complete incident lifecycle—preventive multi‑layer monitoring, chaos‑testing drills, rapid 10‑minute response tactics, systematic root‑cause analysis, effective communication roles, post‑mortem reviews, and practical case studies—helping teams minimize downtime and business loss.

Incident ManagementRoot Cause AnalysisSRE

0 likes · 11 min read

Mastering SRE: Fast Incident Response and Prevention Strategies

转转QA

Sep 11, 2025 · Operations

How to Turn QA Testing into a Proactive Defense System

This article explains how to make QA proactive by mastering pre‑release collective testing, daily monitoring of core interfaces, immersive online inspections, and a closed‑loop post‑incident review, turning routine checks into a powerful, collaborative quality shield.

QApostmortemprocess improvement

0 likes · 11 min read

How to Turn QA Testing into a Proactive Defense System

Java Web Project

Jul 26, 2025 · Backend Development

How a Simple Pagination Change Triggered a P0 Outage and What We Learned

A seemingly trivial pagination update in a Java order service caused a P0 outage, leading to a 73‑minute disruption, 156 user complaints, and an estimated 650,000 CNY GMV loss; the post details the root cause, impact analysis, emergency response, and concrete process improvements to prevent recurrence.

Incident ManagementJavaMicroservices

0 likes · 14 min read

How a Simple Pagination Change Triggered a P0 Outage and What We Learned

Test Development Learning Exchange

Oct 10, 2024 · R&D Management

Comprehensive Guide to Bug Incident Management, Post‑mortem, and Process Improvement

This guide outlines a complete workflow for handling software bugs—from immediate reporting and triage through impact assessment, resolution strategies, post‑incident analysis, and long‑term process, testing, and organizational improvements—to ensure stable releases and continuous quality enhancement.

CI/CDRisk Managementbug management

0 likes · 14 min read

Comprehensive Guide to Bug Incident Management, Post‑mortem, and Process Improvement

ITPUB

Aug 13, 2024 · Databases

Who Should Own Database Change Failures? A Deep Dive into Roles, Risks, and Responsibility

An in‑depth analysis of database change workflows reveals three key players—business developers, DBAs, and change tools—outlines the full lifecycle of a change, illustrates a MySQL VARCHAR length mishap, and argues why the business team should bear primary responsibility while DBAs assume secondary accountability.

DBAdatabase change managementmysql

0 likes · 9 min read

Who Should Own Database Change Failures? A Deep Dive into Roles, Risks, and Responsibility

Continuous Delivery 2.0

Aug 13, 2024 · R&D Management

Is Amazon's COE Process Really Effective? Insights from SDEs

The article examines Amazon's Correction of Errors (COE) process, presenting both supportive and critical SDE perspectives, and discusses whether detailed post‑incident documentation truly improves engineering practices or merely adds bureaucratic overhead.

COEEngineering CultureIncident Management

0 likes · 10 min read

Is Amazon's COE Process Really Effective? Insights from SDEs

ITPUB

May 18, 2024 · Operations

How One Mistyped SQL Wiped All Orders—and the 45‑Minute Recovery That Followed

A quiet Saturday turned into a disaster when a simple UPDATE query accidentally deleted every order in production, prompting a rapid, step‑by‑step recovery, a post‑mortem analysis of the root causes, and a set of hard‑won operational lessons for any engineering team.

SQLincident responsepostmortem

0 likes · 8 min read

How One Mistyped SQL Wiped All Orders—and the 45‑Minute Recovery That Followed

dbaplus Community

May 18, 2024 · Databases

How Linear’s Biggest Outage Happened: A PostgreSQL Truncate Disaster and Recovery Walkthrough

The article details Linear's most severe five‑year outage caused by an accidental TRUNCATE CASCADE on a PostgreSQL table, walks through the minute‑by‑minute timeline, explains why CI checks missed the error, and outlines the recovery steps and post‑mortem lessons for future database safety.

CIData RecoveryDatabase Incident

0 likes · 10 min read

How Linear’s Biggest Outage Happened: A PostgreSQL Truncate Disaster and Recovery Walkthrough

ITPUB

Feb 19, 2024 · Databases

What Caused Linear’s Massive Data Loss and How They Recovered It

Linear, the SaaS project‑management tool, suffered a catastrophic data loss when a TRUNCATE CASCADE command unintentionally wiped production tables, prompting a detailed post‑mortem that outlines the timeline, root cause, recovery steps, impact, and a set of concrete preventive measures.

CI/CDDataRecoveryIncidentManagement

0 likes · 10 min read

What Caused Linear’s Massive Data Loss and How They Recovered It

Architecture and Beyond

Dec 2, 2023 · Operations

Postmortem Analysis of the Yuque Service Outage and Lessons on Complex Systems and the KISS Principle

The article reviews the October 23 Yuque service outage, analyzes root causes such as a buggy upgrade tool and outdated storage, extracts operational lessons on testing, disaster recovery, high‑availability, communication, and advocates the KISS principle to simplify complex systems for improved reliability.

KISS principleOperationscomplex systems

0 likes · 10 min read

Postmortem Analysis of the Yuque Service Outage and Lessons on Complex Systems and the KISS Principle

Su San Talks Tech

Oct 27, 2023 · Operations

What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review

This article walks through Yuque’s October 23 service disruption, detailing each timeline milestone, analyzing the root causes, highlighting the importance of monitoring and data integrity checks, and offering concrete post‑mortem recommendations to improve future incident handling.

Cloud ServicesDisaster RecoveryMonitoring

0 likes · 12 min read

What We Learned from Yuque’s October 23 Outage: A Detailed Incident Review

Tech Architecture Stories

Aug 8, 2023 · Operations

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

This comprehensive guide explains the origins, methodologies, and practical steps of fault postmortems—including PDCA, GRIA, aviation safety lessons, industrial accident theory, and software reliability metrics—to help teams systematically investigate incidents, derive actionable improvements, and continuously enhance system availability.

GRIAPDCAReliability

0 likes · 22 min read

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

Tech Architecture Stories

Aug 7, 2023 · Operations

Mastering Fault Postmortems: Proven Methods to Boost System Reliability

This article explains the essence, purpose, and step‑by‑step process of fault postmortems—including preparation, root‑cause analysis, improvement actions, and decision making—while covering PDCA and GRIA methodologies, industry examples, MTTR/MTBF metrics, and practical templates for lasting reliability.

GRIAIncident ManagementMTTR

0 likes · 24 min read

HelloTech

Nov 22, 2022 · Operations

Guidelines for Incident Postmortem and Fault Review

The incident postmortem guideline advocates a dialectical view of failures, rapid low‑severity recovery, and a structured process—covering background, impact scope, timeline replay, deep root‑cause analysis, SMART improvement actions, responsibility assignment, and PDCA‑validated closure—to enhance system resilience, team anti‑fragility, and knowledge sharing.

High AvailabilityMTBFMTTR

0 likes · 15 min read

Guidelines for Incident Postmortem and Fault Review

Xiaohe Frontend Team

Nov 15, 2022 · Operations

Mastering Incident Postmortems: Turn Failures into Learning Opportunities

This article explains why thorough, blameless incident postmortems are essential, outlines when to initiate them, describes the key components of an effective review, and offers practical steps to transform each outage into a continuous‑improvement opportunity for engineering teams.

Blameless CultureIncident ManagementRoot Cause Analysis

0 likes · 6 min read

Mastering Incident Postmortems: Turn Failures into Learning Opportunities

Big Data Technology Architecture

Jul 14, 2022 · Operations

Postmortem of Bilibili SLB Outage on July 13, 2021

This postmortem details the July 13, 2021 Bilibili outage caused by a Lua‑induced CPU 100% bug in the OpenResty‑based SLB, describing the incident timeline, root‑cause analysis, mitigation steps, and the subsequent technical and process improvements to enhance reliability and multi‑active deployment.

Load BalancerLuaOperations

0 likes · 16 min read

Postmortem of Bilibili SLB Outage on July 13, 2021

21CTO

Feb 9, 2022 · Operations

Why Roblox’s Three‑Day Outage Happened: Consul Streaming Bug and BoltDB Design Flaw

Roblox’s detailed post‑mortem reveals that a three‑day outage was caused by a Consul streaming bug and a design flaw in BoltDB’s freelist, which together created CPU contention and latency spikes on its massive on‑premises infrastructure, leading the team to disable streaming, add a second data‑center, and redesign their architecture.

BoltDBConsulOutage

0 likes · 9 min read

Why Roblox’s Three‑Day Outage Happened: Consul Streaming Bug and BoltDB Design Flaw

Efficient Ops

Jul 11, 2021 · Operations

Mastering Incident Management: Principles and Methods for Effective Fault Handling

This guide outlines essential incident management principles—prioritizing business recovery and timely escalation—and presents practical methodologies such as restart, isolation, and downgrade, while detailing user impact handling, organizational roles, and post‑mortem best practices.

Incident ManagementOperationsescalation

0 likes · 10 min read

Mastering Incident Management: Principles and Methods for Effective Fault Handling

58UXD

Jun 16, 2021 · Fundamentals

Why and How Designers Should Do Project Post‑Mortems

This guide explains why project post‑mortems are essential for designers, outlines a simple review framework, and shares a step‑by‑step case study of creating sticker designs, offering practical tips and reflections to improve future design work.

ReflectionUXdesign

0 likes · 6 min read

Why and How Designers Should Do Project Post‑Mortems

macrozheng

Jan 12, 2021 · R&D Management

Mastering Internet Project Workflow: Roles, Timelines, and Postmortems

This article outlines the typical roles in an internet company, explains how cross‑department collaboration, timeline definition, resource allocation, development‑testing‑deployment phases, and comprehensive project retrospectives work together to ensure successful project delivery.

postmortemproject managementresource allocation

0 likes · 10 min read

Mastering Internet Project Workflow: Roles, Timelines, and Postmortems

dbaplus Community

Nov 21, 2020 · Operations

What Google’s Debugging Playbook Can Teach Distributed Storage Teams

Drawing on Google’s SRE experience and the author’s work with Filecoin, this article outlines practical strategies for debugging large‑scale distributed systems, covering organizational culture, measurement, blameless postmortems, engineer mindsets, incident response steps, and tooling recommendations.

FilecoinGoogle SREpostmortem

0 likes · 15 min read

What Google’s Debugging Playbook Can Teach Distributed Storage Teams

JD Retail Technology

Apr 27, 2020 · R&D Management

How to Build a Sustainable Post‑Mortem Culture for Tech Projects

This guide outlines a complete post‑mortem (retrospective) framework—from defining types and preparing meetings to conducting effective sessions and tracking outcomes—helping teams continuously improve project execution and embed a learning‑focused culture across the organization.

R&D processcontinuous improvementpostmortem

0 likes · 9 min read

How to Build a Sustainable Post‑Mortem Culture for Tech Projects

Efficient Ops

Apr 12, 2020 · Operations

Master Incident Management: Definitions, Processes, and Best Practices

This guide explains fault management fundamentals—from ITIL‑based definitions and why it matters, to fault level classification, monitoring, emergency response, recovery, post‑mortem analysis, continuous improvement, and practical advice for practitioners—providing a comprehensive, actionable framework for reliable operations.

ITILIncident ManagementOperations

0 likes · 11 min read

Master Incident Management: Definitions, Processes, and Best Practices

Continuous Delivery 2.0

Mar 6, 2020 · Operations

Google Incident Postmortem Checklist

The article presents a detailed Google‑derived post‑mortem checklist covering event data collection, root‑cause analysis, lessons learned, actionable improvement items, and review procedures to ensure systematic, non‑blame‑focused incident handling.

Incident ManagementOperationsRoot Cause Analysis

0 likes · 5 min read

21CTO

Jun 18, 2019 · Operations

Why Embracing Failure Accelerates Growth: Lessons from Intuit and PayPal

The article explains how organizations can achieve rapid growth by openly acknowledging failures, creating lightweight post‑mortem processes, and continuously learning from mistakes, illustrated through Intuit’s SaaS transition, PayPal’s rollback challenges, and practical rules for QA and architecture.

QARollbackSaaS

0 likes · 31 min read

Why Embracing Failure Accelerates Growth: Lessons from Intuit and PayPal

ITPUB

May 15, 2017 · Operations

Mastering Online Incident Management: From Detection to Prevention

This article outlines a comprehensive methodology for handling large‑scale online service incidents, covering goals, the "jump‑fill‑avoid" framework, step‑by‑step processes for detection, diagnosis, remediation, and post‑mortem analysis, as well as essential monitoring, logging, and escalation infrastructure.

Incident ManagementMonitoringOperations

0 likes · 18 min read

Mastering Online Incident Management: From Detection to Prevention

Qunar Tech Salon

Feb 4, 2017 · Operations

GitLab Database Deletion Incident: Lessons on Backup, Operations, and High‑Availability Design

The article recounts a GitLab production database deletion caused by a mistaken command, analyzes why the backup mechanisms failed, and offers technical and cultural recommendations—including automation, proper replication, and transparent post‑mortems—to build more reliable, high‑availability systems.

GitLabOperationsPostgreSQL

0 likes · 15 min read

GitLab Database Deletion Incident: Lessons on Backup, Operations, and High‑Availability Design

Efficient Ops

Feb 2, 2017 · Operations

GitLab.com Database Disaster: How a Mistyped rm Command Wiped 300GB and What We Learned

GitLab.com suffered a catastrophic database outage on February 1, 2017 when an exhausted operator mistakenly ran a destructive rm command on the wrong server, wiping most production data; the incident’s timeline, root causes, recovery steps, and lessons learned are detailed in this post‑mortem.

Database IncidentGitLabOperations

0 likes · 12 min read

GitLab.com Database Disaster: How a Mistyped rm Command Wiped 300GB and What We Learned

Efficient Ops

Oct 23, 2016 · Operations

How Google’s SRE Postmortems Drive System Reliability

This article explains Google’s SRE postmortem philosophy, the criteria for writing postmortems, best practices for a blame‑free culture, and how collaborative knowledge‑sharing and incentives improve incident handling and overall system reliability.

Incident ManagementOperationsSRE

0 likes · 14 min read

How Google’s SRE Postmortems Drive System Reliability

Efficient Ops

Sep 13, 2016 · Operations

How Google SRE Principles Compare Across Industries

This article, excerpted from the upcoming Chinese edition of “SRE: Google Site Reliability Engineering”, examines how Google’s SRE guiding philosophies—disaster planning, post‑mortem culture, automation, and data‑driven decision‑making—are adopted, adapted, or contrasted in sectors such as manufacturing, aerospace, nuclear, telecommunications, healthcare, and finance, highlighting key similarities, differences, and lessons for Google and the broader tech industry.

AutomationOperationsRisk Management

0 likes · 21 min read

How Google SRE Principles Compare Across Industries

Efficient Ops

Sep 10, 2015 · Operations

How Google’s DevOps Practices Enable Instant Issue Detection and Swarming Resolution

This article explores Randy Shoup’s interview on Google’s DevOps culture, detailing how high‑efficiency organizations instantly detect problems, use swarming to resolve them, document lessons as new knowledge, and foster a blameless post‑mortem culture that drives continuous improvement.

Automation testingGoogleIncident Management

0 likes · 14 min read

How Google’s DevOps Practices Enable Instant Issue Detection and Swarming Resolution

Qunar Tech Salon

Mar 24, 2015 · Operations

Knight Capital's $460 Million Trading Bug: A Post‑mortem of Deployment and Operational Failures

The article recounts how a decades‑old, unused code path was unintentionally re‑activated during a rushed deployment of the Retail Liquidity Program, leading Knight Capital to send erroneous orders that caused a $460 million loss and the firm’s bankruptcy.

Risk ManagementTrading Systemspostmortem

0 likes · 9 min read

Knight Capital's $460 Million Trading Bug: A Post‑mortem of Deployment and Operational Failures