Operations 9 min read

Bizarre Ops Disasters: Real‑World Kubernetes, Script, and Server Mishaps

This article compiles six shocking operations incidents—from a Kubernetes PV/PVC deletion that erased an entire codebase, to a careless kill‑script that terminated critical services, a rookie admin formatting servers without backup, ESXi CPU saturation causing stock‑exchange timeouts, and a production DB expansion that wiped transaction data—highlighting the dire consequences of inadequate safeguards and the importance of rigorous operational practices.

dbaplus Community
dbaplus Community
dbaplus Community
Bizarre Ops Disasters: Real‑World Kubernetes, Script, and Server Mishaps

1. Kubernetes PV/PVC Deletion Disaster

A newcomer to Kubernetes joined a small company that ran everything—including databases and GitLab—on K8s using a single NFS‑backed PV and PVC, partitioned by subdirectories for each service. When attempting to delete a service’s PVC, the admin cancelled the operation after it seemed stuck, only to discover K8s had cleared the entire shared directory, erasing all code. A month‑old backup allowed a rollback, but the incident underscored the risk of sharing a single persistent volume across multiple workloads without proper isolation.

2. Dangerous Kill‑Script Accident

A developer wrote a shell script that used ps | grep | xargs kill to terminate processes matching certain criteria. The script worked in the developer’s container, but when deployed to production it inadvertently killed a critical process because an awk command extracted the entire line instead of just the PID, causing the process’s arguments (including resource flags) to match a vital middleware container. The resulting cascade of failures demonstrated how a seemingly harmless script can cause massive outages when run against many containers.

3. Rookie Admin Formats Server Without Backup

At a company with a four‑node Dell server cluster, a newly hired operations engineer was tasked with formatting one server that housed the code repository. Despite repeated warnings to back up before proceeding, the engineer proceeded, causing the server to become unreachable. By the time senior staff intervened, the server had already been formatted. Fortunately, backups existed, but the incident highlighted the importance of enforcing backup policies and not allowing inexperienced staff to perform high‑risk actions without supervision.

4. (No specific story provided for this entry)

The original collection lists a fourth contributor but does not include a detailed incident.

5. Stock‑Exchange Service Timeout Due to ESXi Overload

During peak trading hours, a securities firm experienced frequent service timeouts despite low pod and node resource usage. Investigation revealed that the underlying ESXi hosts were hitting 100% CPU utilization because a recent BIOS update and VM migrations had created an uneven load across the hypervisor cluster. After rebalancing the VMs, the timeouts disappeared, illustrating how infrastructure‑level bottlenecks can surface as application‑level performance issues.

6. Production Database Expansion Catastrophe

In a major e‑commerce promotion, a senior DBA was pressured to expand the primary database in production without prior testing. While executing the expansion, a script intended for test‑environment data deletion was mistakenly run with root privileges, instantly wiping core transaction data. The backup strategy failed as the most recent usable backup was three days old, resulting in several hours of downtime and multi‑million‑dollar losses. The episode emphasizes the critical need for change‑control, proper testing, and reliable backup windows.

Collectively, these anecdotes demonstrate that operations is far from a trivial support role; inadequate safeguards, insufficient testing, and poor change management can lead to severe outages, data loss, and financial impact.

Source: Zhihu discussion (https://www.zhihu.com/question/653030041)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ScriptingIncident
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.