Operations 9 min read

Why You Should Never Trust Any Component in Your System—and How to Protect It

In programming and operations, every element—from services and dependencies to requests, machines, data centers, power, networks, and humans—can fail unexpectedly, so you must assume distrust and implement defensive measures such as monitoring, redundancy, rate limiting, fallback strategies, backups, and automated deployment.

21CTO

Sep 26, 2017

Why You Should Never Trust Any Component in Your System—and How to Protect It

1 Programming World’s Ten Ambushes

In the upstream and downstream chain, no point can be guaranteed absolutely reliable; any point may fail unexpectedly.

Therefore, you cannot trust any point in the chain and must set defenses.

01 Distrust of the Service Itself

Main measures:

(1) Service Monitoring – monitor request volume, success/failure counts, key nodes, success rate; add automated testing to simulate scenarios.

(2) Process Rapid Restart – since humans are unreliable, implement rapid process restart to mitigate core dumps and continue service.

02 Distrust of Dependent Systems

Adopt flexible availability strategies, distinguishing critical and non‑critical paths.

(1) Non‑critical paths – limited retries or skip logic when timeout exceeds thresholds.

(2) Critical paths – provide degraded services; e.g., when ticket storage is unavailable, generate algorithmic tickets with shortened validity.

03 Distrust of Requests

(1) Distrust of request source

Permission control: IP authentication, module authentication, whitelist, login verification.

Security audit: detect abnormal machine behavior, replay attacks, hijacking.

(2) Distrust of request volume

Rate limiting – cap maximum requests per service.

Overload protection – drop excess requests during spikes to keep partial availability.

3 The Operational World Is Unpredictable

01 Distrust of Machines

(1) Disaster‑recovery deployment – have at least two machines ready to serve.

(2) Heartbeat detection – monitor machine health, auto‑switch or disable faulty nodes.

02 Distrust of Data Centers

Example: the 2015 Tianjin explosion made multiple data centers unavailable.

(1) Geographic dispersion – deploy across different IDC, cities, or countries.

(2) Capacity redundancy – maintain more than twice the capacity for entry services such as QQ login.

03 Distrust of Power

(1) Disk backup – restore data after power restoration, accepting possible minor loss.

(2) Remote backup – store critical data off‑site to survive disk failure.

04 Distrust of Network

(1) Varying latency – use proximity routing (e.g., CMLB) or network probing (e.g., Q‑调) to choose optimal paths.

(2) Network instability – auto‑disable unreliable nodes, employ local response statistics and periodic probing for recovery.

05 Distrust of Humans

(1) Operation backup – record every step, review critical actions.

(2) Effect verification – review and validate changes in production.

(3) Rollback capability – backup old programs/configurations for quick recovery.

(4) Automated deployment – automate complex deployment processes to avoid human omission.

(5) Consistency checks – monitor version numbers or process/config consistency across machines.

Note: The distrust strategies often need to be combined with other measures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations fault tolerance Reliability security redundancy

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.