Operations 7 min read

How a Simple System‑Time Change Sparked a Massive Outage

A junior ops engineer mistakenly set the production server clock ahead by a year, causing thousands of user accounts to expire, triggering a large‑scale outage, emergency fixes, financial loss, and harsh career consequences, while highlighting the need for proper permission and change management.

Efficient Ops

Dec 11, 2023

How a Simple System‑Time Change Sparked a Massive Outage

A new operations engineer at an internet company was asked by a receptionist to correct a billing system that was running a year behind. Without thinking, he changed the Linux system time by one year, assuming it would not affect the billing system.

Shortly after the change, all online users disappeared. Customer support received hundreds of calls about network outages, and the monitoring system raised large‑scale disconnection alerts. The engineer realized a serious incident was occurring.

The team leader, "Tao", was alerted and rushed to the site with other ops engineers and DBAs. They discovered that the time shift caused all accounts that should have expired within the next year to be considered expired, kicking more than 3,000 users offline and preventing them from logging back in.

The DBA wrote a SQL statement to extend the expiration dates of those accounts to the end of the year, backed up the relevant tables, and executed the fix. The operation took over 40 minutes, after which the affected users could log in again and the service was restored, though the billing records no longer matched the financial statements, resulting in a loss of over 400,000 yuan.

The incident was classified as a Level‑1 severe accident caused by human error. The engineer was demoted, lost his bonus, and underwent a three‑month reassessment. His manager also received a serious disciplinary record.

Comments from other readers highlighted poor permission management, the danger of giving front‑desk staff direct access to production systems, and similar mishaps such as accidental database deletions, network debug commands that took a core switch offline, and a short‑circuit that blacked out an entire base station.

The article concludes by urging discussion on how to improve process, policy, and permission management to avoid such incidents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

database Incident Management permissions system time

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.