Why Human Errors Still Plague Modern Ops and How to Prevent Them
This article examines recent high‑profile internet outages caused by human error, explores why operations teams are especially prone to mistakes despite automation and standards, and offers practical strategies—such as hiring the right people, fostering safety awareness, and turning professionalism into habit—to reduce future incidents.
Introduction
This is the seventh article in the "Efficient Operations Best Practices" series, originally written by Xiao Tianguo and reposted with permission. Unauthorized reproduction is prohibited.
Unstable Internet
Recent major incidents from companies such as Ctrip and Alibaba Cloud, as well as smaller yet equally avoidable human‑error incidents, illustrate the precarious state of the internet.
Alibaba Cloud Incident 901
Alibaba Cloud reported a bug triggered by a Cloud Shield upgrade that mistakenly isolated some files on certain servers. The issue was rolled back immediately, and the isolated files are being restored.
Many Alibaba Cloud users reported being affected.
From a technical perspective, the incident appears to be primarily a human error.
Ctrip Incident 528
The late‑May Ctrip outage took 17 hours to recover, and the company explicitly blamed "employee operational mistakes".
Other Shocking Incidents
Examples include:
A technician replaced the power supply of a billing database server during business hours.
A bank's mainframe had a critical cable connected in reverse.
A bank's transaction system executed an UPDATE statement without a WHERE clause, resetting all branch information.
Based on over a decade of internet experience, at least 60% of major failures stem from low‑level human errors; truly complex system‑related failures are rare.
Why So Many Human Errors?
Despite decades of internet growth, many practices remain primitive. Technologically we have moved from "small tools" to semi‑automated, open‑source‑based systems, yet human management has lagged far behind.
Technical articles are popular, while management‑oriented pieces receive little attention, reflecting a "craftsman" mindset among engineers.
People start with Linux and Shell, enjoy the thrill of a few commands that deploy dozens of servers, and mistakenly believe that is all they need.
Many engineers view themselves as flawless, resisting oversight and believing that technology alone can solve everything. However, as systems grow larger and more automated, the gap between human skill and system complexity widens.
Why Ops Are More Prone to Accidents
Developers focus on delivering features quickly; bugs are mitigated by testing. Ops, on the other hand, operate directly on production environments with little oversight, making them vulnerable to mistakes.
This is why aircraft have two pilots; the co‑pilot may seem idle, but can save the flight in critical moments.
Why Standards Alone Don't Prevent Accidents
Standards constrain behavior, but they cannot replace human vigilance. Over‑reliance on policies without active management leads to complacency.
Why Automation Still Leads to Errors
Automation reduces routine manual actions, yet it can amplify the impact of a single mistake. When an automated platform executes a faulty operation, the damage spreads instantly across many machines.
Why Gray‑Release Strategies Fail
Even with sophisticated gray‑release mechanisms, if testing environments are insufficient or operators misuse the tools, incidents still occur. Poorly designed gray‑release strategies can turn a safety net into a weapon.
How to Avoid Human Errors
Updating production systems without equivalent staging environments is akin to changing a tire on a highway—dangerous and risky. Human errors are systemic, not isolated incidents.
Choose the Right People
Hiring individuals whose temperament aligns with production responsibilities (cautious, detail‑oriented) dramatically reduces risk. Traits like recklessness are unsuitable for high‑stakes ops.
Foster Safety Awareness
Instill a reverence for operational tasks. Safety consciousness outweighs rote memorization of rules; the latter must support, not replace, an ingrained safety mindset.
Make Professionalism a Habit
As Aristotle said, "We are what we repeatedly do." Professional habits develop through consistent practice, mentorship, and regular incident reviews.
Managers should lead by example, select capable personnel, and enforce regular drills and peer‑review mechanisms (e.g., paired ops) to maintain high vigilance.
Without a checking role, a lone operator may make poor decisions under pressure, leading to severe mistakes.
For public‑cloud users, participating in cloud insurance can provide additional protection against large‑scale losses.
May we all pursue continuous improvement in operations.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.