Operations 8 min read

How a Simple Time Adjustment Sparked a Massive Outage: Real Ops Incident Stories

Three real-world operations mishaps are recounted—a mistaken system‑time change that logged out thousands of users, an accidental bulk delete of database accounts, and a failed glibc downgrade that stalled a software release—illustrating the cascading impact of small errors and the urgent remediation steps taken.

dbaplus Community

Jan 8, 2024

How a Simple Time Adjustment Sparked a Massive Outage: Real Ops Incident Stories

Incident 1: Time‑shift outage A new employee noticed the billing system’s clock was exactly one year behind and, trying to “fix” it, changed the Linux system time. The change caused the system to treat all accounts that had expired within the past year as invalid, instantly logging out thousands of users. Customer support received hundreds of outage calls, and the monitoring system raised a large‑scale disconnection alert. The on‑call leader, “Tao”, coordinated a rapid response: DBAs queried the affected accounts (over 3,000) and, rather than reverting the system clock, updated each account’s expiration date to the end of the year. After a 40‑minute effort, users could log in again, but the billing records were off by over 400,000 CNY, and the incident was classified as a Level‑1 severe, human‑error incident.

Incident 2: Accidental bulk delete While cleaning up a test user in a client‑facing database, a developer executed delete from users without realizing foreign‑key constraints prevented the rows from being removed. The database, built with phpStudy and lacking binlog, contained hundreds of thousands of users worth millions of yuan in promotional spend. The operation failed, causing panic but no data loss.

Incident 3: Glibc version mismatch A request to install software on a dedicated server led to an apt install that pulled unexpected packages. Later, the development team discovered the server’s glibc version was incompatible with their release, halting the deployment. Attempts to downgrade glibc failed despite extensive searching; the issue persisted through the National Day holiday, forcing the team to rebuild the environment from scratch. The incident highlighted the need for strict approval processes for software installations and the importance of environment consistency.

Across all three cases, the common lessons are the critical need for careful change management, thorough verification before executing commands that affect production systems, and rapid, coordinated incident response to mitigate impact.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Database Linux Incident Management sysadmin

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.