Operations 14 min read

Why Human Errors Still Plague Modern Ops and How to Prevent Them

This article examines recent high‑profile internet outages caused by human error, explores why operations teams are especially prone to mistakes despite automation and standards, and offers practical strategies—such as hiring the right people, fostering safety awareness, and turning professionalism into habit—to reduce future incidents.

Efficient Ops
Efficient Ops
Efficient Ops
Why Human Errors Still Plague Modern Ops and How to Prevent Them

Introduction

This is the seventh article in the "Efficient Operations Best Practices" series, originally written by Xiao Tianguo and reposted with permission. Unauthorized reproduction is prohibited.

Unstable Internet

Recent major incidents from companies such as Ctrip and Alibaba Cloud, as well as smaller yet equally avoidable human‑error incidents, illustrate the precarious state of the internet.

Alibaba Cloud Incident 901

Alibaba Cloud reported a bug triggered by a Cloud Shield upgrade that mistakenly isolated some files on certain servers. The issue was rolled back immediately, and the isolated files are being restored.

Many Alibaba Cloud users reported being affected.

From a technical perspective, the incident appears to be primarily a human error.

Ctrip Incident 528

The late‑May Ctrip outage took 17 hours to recover, and the company explicitly blamed "employee operational mistakes".

Other Shocking Incidents

Examples include:

A technician replaced the power supply of a billing database server during business hours.

A bank's mainframe had a critical cable connected in reverse.

A bank's transaction system executed an UPDATE statement without a WHERE clause, resetting all branch information.

Based on over a decade of internet experience, at least 60% of major failures stem from low‑level human errors; truly complex system‑related failures are rare.

Why So Many Human Errors?

Despite decades of internet growth, many practices remain primitive. Technologically we have moved from "small tools" to semi‑automated, open‑source‑based systems, yet human management has lagged far behind.

Technical articles are popular, while management‑oriented pieces receive little attention, reflecting a "craftsman" mindset among engineers.

People start with Linux and Shell, enjoy the thrill of a few commands that deploy dozens of servers, and mistakenly believe that is all they need.

Many engineers view themselves as flawless, resisting oversight and believing that technology alone can solve everything. However, as systems grow larger and more automated, the gap between human skill and system complexity widens.

Why Ops Are More Prone to Accidents

Developers focus on delivering features quickly; bugs are mitigated by testing. Ops, on the other hand, operate directly on production environments with little oversight, making them vulnerable to mistakes.

This is why aircraft have two pilots; the co‑pilot may seem idle, but can save the flight in critical moments.

Why Standards Alone Don't Prevent Accidents

Standards constrain behavior, but they cannot replace human vigilance. Over‑reliance on policies without active management leads to complacency.

Why Automation Still Leads to Errors

Automation reduces routine manual actions, yet it can amplify the impact of a single mistake. When an automated platform executes a faulty operation, the damage spreads instantly across many machines.

Why Gray‑Release Strategies Fail

Even with sophisticated gray‑release mechanisms, if testing environments are insufficient or operators misuse the tools, incidents still occur. Poorly designed gray‑release strategies can turn a safety net into a weapon.

How to Avoid Human Errors

Updating production systems without equivalent staging environments is akin to changing a tire on a highway—dangerous and risky. Human errors are systemic, not isolated incidents.

Choose the Right People

Hiring individuals whose temperament aligns with production responsibilities (cautious, detail‑oriented) dramatically reduces risk. Traits like recklessness are unsuitable for high‑stakes ops.

Foster Safety Awareness

Instill a reverence for operational tasks. Safety consciousness outweighs rote memorization of rules; the latter must support, not replace, an ingrained safety mindset.

Make Professionalism a Habit

As Aristotle said, "We are what we repeatedly do." Professional habits develop through consistent practice, mentorship, and regular incident reviews.

Managers should lead by example, select capable personnel, and enforce regular drills and peer‑review mechanisms (e.g., paired ops) to maintain high vigilance.

Without a checking role, a lone operator may make poor decisions under pressure, leading to severe mistakes.

For public‑cloud users, participating in cloud insurance can provide additional protection against large‑scale losses.

May we all pursue continuous improvement in operations.

automationoperationsbest practicesincident managementhuman error
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.