Tagged articles
14 articles
Page 1 of 1
dbaplus Community
dbaplus Community
Mar 23, 2026 · Operations

How a Single AI‑Driven Command Wiped 2.5 Years of Production Data

In this detailed post‑mortem, Alexey Grigorev recounts how using Claude Code to automate a Terraform deployment unintentionally erased his entire production environment and two‑and‑a‑half years of data, exposing the risks of over‑reliance on AI‑driven automation and highlighting essential safeguards.

AIAWSAutomation
0 likes · 11 min read
How a Single AI‑Driven Command Wiped 2.5 Years of Production Data
dbaplus Community
dbaplus Community
Nov 15, 2025 · Operations

What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience

A chronicle of real‑world operations incidents from 2003 to 2024 shows how simple mistakes—mis‑configured passwords, unplugged cables, hardware mix‑ups, and rushed cloud migrations—can cascade into massive outages, offering hard‑earned lessons for anyone managing production systems.

Case StudyIncidentInfrastructure
0 likes · 11 min read
What 20 Years of Ops Mishaps Reveal About Infrastructure Resilience
Efficient Ops
Efficient Ops
Jun 18, 2025 · Operations

Bizarre Ops Nightmares: Real-World Failures That Shocked Engineers

A collection of startling operational mishaps—from a disastrous database expansion during a sales event to a Kubernetes storage blunder, a misconfigured ESXi host, a company‑wide Excel crash, and a power‑maintenance disaster that fried servers—illustrates the critical importance of proper procedures, backups, and infrastructure monitoring.

IncidentOperationsUPS
0 likes · 7 min read
Bizarre Ops Nightmares: Real-World Failures That Shocked Engineers
ITPUB
ITPUB
Apr 25, 2025 · Operations

Bizarre Ops Disasters: From K8s Data Loss to Accidental Process Killing

This article recounts a series of shocking operational mishaps—including a Kubernetes PV/PVC deletion that erased an entire codebase, a careless shell script that killed the wrong processes, a rookie’s risky server formatting, and a mysterious Excel crash—highlighting the importance of proper backups, testing, and change control.

IncidentResource Monitoringshell script
0 likes · 7 min read
Bizarre Ops Disasters: From K8s Data Loss to Accidental Process Killing
dbaplus Community
dbaplus Community
Mar 16, 2025 · Operations

Bizarre Ops Disasters: Real‑World Kubernetes, Script, and Server Mishaps

This article compiles six shocking operations incidents—from a Kubernetes PV/PVC deletion that erased an entire codebase, to a careless kill‑script that terminated critical services, a rookie admin formatting servers without backup, ESXi CPU saturation causing stock‑exchange timeouts, and a production DB expansion that wiped transaction data—highlighting the dire consequences of inadequate safeguards and the importance of rigorous operational practices.

IncidentScripting
0 likes · 9 min read
Bizarre Ops Disasters: Real‑World Kubernetes, Script, and Server Mishaps
Selected Java Interview Questions
Selected Java Interview Questions
Mar 10, 2025 · Backend Development

Postmortem of a Server Crash Caused by a Mis‑managed Scheduled Task in a Backend Module

The article analyzes a server outage triggered by a module that repeatedly created a scheduled task without proper lifecycle control, examines the problematic Java code, lists four key issues, presents a corrected implementation, and reflects on development, testing, review, and logging practices to prevent similar incidents.

BackendDebuggingIncident
0 likes · 5 min read
Postmortem of a Server Crash Caused by a Mis‑managed Scheduled Task in a Backend Module
Ximalaya Technology Team
Ximalaya Technology Team
Sep 13, 2023 · Operations

Cache Instance Failure Incident Analysis and Root Cause Investigation

During a night‑time outage, a XCache (Codis + Pika) instance hung due to massive write load triggering low‑level protection, causing Sentinel to switch masters; the proxy’s accept queue filled with timed‑out sockets, blocking new connections, so scaling the proxy layer and expanding capacity restored service while prompting automation, health‑check, and queue‑overflow alerts.

CacheIncidentOperations
0 likes · 7 min read
Cache Instance Failure Incident Analysis and Root Cause Investigation
Big Data Technology Architecture
Big Data Technology Architecture
Jul 14, 2022 · Operations

Postmortem of Bilibili SLB Outage on July 13, 2021

This postmortem details the July 13, 2021 Bilibili outage caused by a Lua‑induced CPU 100% bug in the OpenResty‑based SLB, describing the incident timeline, root‑cause analysis, mitigation steps, and the subsequent technical and process improvements to enhance reliability and multi‑active deployment.

IncidentLoad BalancerLua
0 likes · 16 min read
Postmortem of Bilibili SLB Outage on July 13, 2021
Liangxu Linux
Liangxu Linux
Apr 25, 2022 · Operations

Why HTTPie Lost 54,000 Stars: A Private Repo Mistake and What It Teaches

An accidental change to make the HTTPie repository private caused GitHub to delete all its 54,000 stars and watches, and despite the author's attempts to restore them, GitHub refused, highlighting risks of repository mismanagement and prompting recommendations for clearer warnings and soft‑delete mechanisms.

GitHubIncidenthttpie
0 likes · 6 min read
Why HTTPie Lost 54,000 Stars: A Private Repo Mistake and What It Teaches
IT Services Circle
IT Services Circle
Mar 22, 2022 · Backend Development

Cache Avalanche Incident: Root Cause, Response, and Prevention Strategies

A recent flash‑sale failure caused by a cache avalanche was analyzed, revealing that setting a uniform two‑hour expiration for all items flooded the database, and the post outlines detection steps, emergency mitigation, and three proven techniques—uniform expiration, mutex locking, and never‑expire caches—to prevent recurrence.

CacheIncident
0 likes · 4 min read
Cache Avalanche Incident: Root Cause, Response, and Prevention Strategies
Java Architect Essentials
Java Architect Essentials
Jun 30, 2021 · Operations

Recovering Accidentally Deleted Production Server Data Using ext3grep, extundelete, and MySQL Binlog

After a junior staff member mistakenly ran an unchecked rm‑rf command that erased an entire production server, the author details a step‑by‑step recovery using ext3grep, custom shell scripts, extundelete, and MySQL binlog replay, and concludes with lessons on backup, monitoring, and change management.

BackupData RecoveryIncident
0 likes · 8 min read
Recovering Accidentally Deleted Production Server Data Using ext3grep, extundelete, and MySQL Binlog
21CTO
21CTO
Dec 15, 2018 · Information Security

When Deleting Databases Becomes Revenge: Real‑World Cases and What You Must Do

This article recounts several real incidents where disgruntled engineers or admins deleted critical databases as retaliation, highlighting the severe consequences and stressing that proper backups and cautious use of destructive commands are essential for any organization.

IncidentOperationsSecurity
0 likes · 5 min read
When Deleting Databases Becomes Revenge: Real‑World Cases and What You Must Do
MaGe Linux Operations
MaGe Linux Operations
Dec 28, 2017 · Operations

Top 12 Linux Ops Disasters of 2017 and What They Teach Us

From Hearthstone’s dual‑database crash to Uber’s massive data breach, this 2017 Linux operations roundup chronicles twelve critical incidents—highlighting backup failures, Docker rebranding, ransomware, BGP hijacking, and more—offering key lessons for sysadmins and DevOps professionals.

BGPBackupDocker
0 likes · 14 min read
Top 12 Linux Ops Disasters of 2017 and What They Teach Us