Tagged articles

incident

15 articles · Page 1 of 1

Jun 1, 2026 · Operations

One Nginx Config Change Triggered a P0 Outage on Promotion Day – 5 Hard‑Earned Lessons

A single missing keepalive setting in Nginx caused a massive P0 outage during a sales promotion, and the article walks through five real incidents—covering logging, WebSocket timeouts, Docker worker counts, reload pitfalls, and SSL expiry—offering concrete configuration fixes and preventive best practices.

DockerKubernetesNGINX

0 likes · 12 min read

One Nginx Config Change Triggered a P0 Outage on Promotion Day – 5 Hard‑Earned Lessons

dbaplus Community

Mar 23, 2026 · Operations

How a Single AI‑Driven Command Wiped 2.5 Years of Production Data

In this detailed post‑mortem, Alexey Grigorev recounts how using Claude Code to automate a Terraform deployment unintentionally erased his entire production environment and two‑and‑a‑half years of data, exposing the risks of over‑reliance on AI‑driven automation and highlighting essential safeguards.

AIAWSAutomation

0 likes · 11 min read

How a Single AI‑Driven Command Wiped 2.5 Years of Production Data

dbaplus Community

Nov 15, 2025 · Operations

What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience

A chronicle of real‑world operations incidents from 2003 to 2024 shows how simple mistakes—mis‑configured passwords, unplugged cables, hardware mix‑ups, and rushed cloud migrations—can cascade into massive outages, offering hard‑earned lessons for anyone managing production systems.

Case StudyOperationsincident

0 likes · 11 min read

What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience

Efficient Ops

Jun 18, 2025 · Operations

Bizarre Ops Nightmares: Real-World Failures That Shocked Engineers

A collection of startling operational mishaps—from a disastrous database expansion during a sales event to a Kubernetes storage blunder, a misconfigured ESXi host, a company‑wide Excel crash, and a power‑maintenance disaster that fried servers—illustrates the critical importance of proper procedures, backups, and infrastructure monitoring.

OperationsUPSfailure

0 likes · 7 min read

Bizarre Ops Nightmares: Real-World Failures That Shocked Engineers

ITPUB

Apr 25, 2025 · Operations

Bizarre Ops Disasters: From K8s Data Loss to Accidental Process Killing

This article recounts a series of shocking operational mishaps—including a Kubernetes PV/PVC deletion that erased an entire codebase, a careless shell script that killed the wrong processes, a rookie’s risky server formatting, and a mysterious Excel crash—highlighting the importance of proper backups, testing, and change control.

Resource MonitoringShell scriptincident

0 likes · 7 min read

Bizarre Ops Disasters: From K8s Data Loss to Accidental Process Killing

dbaplus Community

Mar 16, 2025 · Operations

Bizarre Ops Disasters: Real‑World Kubernetes, Script, and Server Mishaps

This article compiles six shocking operations incidents—from a Kubernetes PV/PVC deletion that erased an entire codebase, to a careless kill‑script that terminated critical services, a rookie admin formatting servers without backup, ESXi CPU saturation causing stock‑exchange timeouts, and a production DB expansion that wiped transaction data—highlighting the dire consequences of inadequate safeguards and the importance of rigorous operational practices.

Scriptingincident

0 likes · 9 min read

Bizarre Ops Disasters: Real‑World Kubernetes, Script, and Server Mishaps

Selected Java Interview Questions

Mar 10, 2025 · Backend Development

Postmortem of a Server Crash Caused by a Mis‑managed Scheduled Task in a Backend Module

The article analyzes a server outage triggered by a module that repeatedly created a scheduled task without proper lifecycle control, examines the problematic Java code, lists four key issues, presents a corrected implementation, and reflects on development, testing, review, and logging practices to prevent similar incidents.

LoggingScheduledExecutorServiceThreadPool

0 likes · 5 min read

Postmortem of a Server Crash Caused by a Mis‑managed Scheduled Task in a Backend Module

Java Web Project

Jan 21, 2025 · Industry Insights

When an Intern Deleted ByteDance’s Lite Models: Lessons on AI Ops and Culture

An intern at ByteDance accidentally removed all sub‑GB machine‑learning models by deleting a parent directory with skip‑trash, prompting a P0 incident that sparked a massive internal discussion about model impact, flat‑management permissions, and the broader implications for AI operations.

AIByteDanceModel Management

0 likes · 7 min read

When an Intern Deleted ByteDance’s Lite Models: Lessons on AI Ops and Culture

Ximalaya Technology Team

Sep 13, 2023 · Operations

Cache Instance Failure Incident Analysis and Root Cause Investigation

During a night‑time outage, a XCache (Codis + Pika) instance hung due to massive write load triggering low‑level protection, causing Sentinel to switch masters; the proxy’s accept queue filled with timed‑out sockets, blocking new connections, so scaling the proxy layer and expanding capacity restored service while prompting automation, health‑check, and queue‑overflow alerts.

CacheOperationsRoot Cause Analysis

0 likes · 7 min read

Cache Instance Failure Incident Analysis and Root Cause Investigation

Big Data Technology Architecture

Jul 14, 2022 · Operations

Postmortem of Bilibili SLB Outage on July 13, 2021

This postmortem details the July 13, 2021 Bilibili outage caused by a Lua‑induced CPU 100% bug in the OpenResty‑based SLB, describing the incident timeline, root‑cause analysis, mitigation steps, and the subsequent technical and process improvements to enhance reliability and multi‑active deployment.

Load BalancerLuaOperations

0 likes · 16 min read

Postmortem of Bilibili SLB Outage on July 13, 2021

Liangxu Linux

Apr 25, 2022 · Operations

Why HTTPie Lost 54,000 Stars: A Private Repo Mistake and What It Teaches

An accidental change to make the HTTPie repository private caused GitHub to delete all its 54,000 stars and watches, and despite the author's attempts to restore them, GitHub refused, highlighting risks of repository mismanagement and prompting recommendations for clearer warnings and soft‑delete mechanisms.

GitHubhttpieincident

0 likes · 6 min read

Why HTTPie Lost 54,000 Stars: A Private Repo Mistake and What It Teaches

IT Services Circle

Mar 22, 2022 · Backend Development

Cache Avalanche Incident: Root Cause, Response, and Prevention Strategies

A recent flash‑sale failure caused by a cache avalanche was analyzed, revealing that setting a uniform two‑hour expiration for all items flooded the database, and the post outlines detection steps, emergency mitigation, and three proven techniques—uniform expiration, mutex locking, and never‑expire caches—to prevent recurrence.

Cacheincident

0 likes · 4 min read

Cache Avalanche Incident: Root Cause, Response, and Prevention Strategies

Java Architect Essentials

Jun 30, 2021 · Operations

Recovering Accidentally Deleted Production Server Data Using ext3grep, extundelete, and MySQL Binlog

After a junior staff member mistakenly ran an unchecked rm‑rf command that erased an entire production server, the author details a step‑by‑step recovery using ext3grep, custom shell scripts, extundelete, and MySQL binlog replay, and concludes with lessons on backup, monitoring, and change management.

Data RecoveryMySQLbackup

0 likes · 8 min read

Recovering Accidentally Deleted Production Server Data Using ext3grep, extundelete, and MySQL Binlog

21CTO

Dec 15, 2018 · Information Security

When Deleting Databases Becomes Revenge: Real‑World Cases and What You Must Do

This article recounts several real incidents where disgruntled engineers or admins deleted critical databases as retaliation, highlighting the severe consequences and stressing that proper backups and cautious use of destructive commands are essential for any organization.

Operationsincidentrm

0 likes · 5 min read

When Deleting Databases Becomes Revenge: Real‑World Cases and What You Must Do

MaGe Linux Operations

Dec 28, 2017 · Operations

Top 12 Linux Ops Disasters of 2017 and What They Teach Us

From Hearthstone’s dual‑database crash to Uber’s massive data breach, this 2017 Linux operations roundup chronicles twelve critical incidents—highlighting backup failures, Docker rebranding, ransomware, BGP hijacking, and more—offering key lessons for sysadmins and DevOps professionals.

BGPDockerbackup

0 likes · 14 min read

Top 12 Linux Ops Disasters of 2017 and What They Teach Us