Why You Should Never Trust Any Component in Your System—and How to Protect It
In programming and operations, every element—from services and dependencies to requests, machines, data centers, power, networks, and humans—can fail unexpectedly, so you must assume distrust and implement defensive measures such as monitoring, redundancy, rate limiting, fallback strategies, backups, and automated deployment.
1 Programming World’s Ten Ambushes
In the upstream and downstream chain, no point can be guaranteed absolutely reliable; any point may fail unexpectedly.
Therefore, you cannot trust any point in the chain and must set defenses.
01 Distrust of the Service Itself
Main measures:
(1) Service Monitoring – monitor request volume, success/failure counts, key nodes, success rate; add automated testing to simulate scenarios.
(2) Process Rapid Restart – since humans are unreliable, implement rapid process restart to mitigate core dumps and continue service.
02 Distrust of Dependent Systems
Adopt flexible availability strategies, distinguishing critical and non‑critical paths.
(1) Non‑critical paths – limited retries or skip logic when timeout exceeds thresholds.
(2) Critical paths – provide degraded services; e.g., when ticket storage is unavailable, generate algorithmic tickets with shortened validity.
03 Distrust of Requests
(1) Distrust of request source
Permission control: IP authentication, module authentication, whitelist, login verification.
Security audit: detect abnormal machine behavior, replay attacks, hijacking.
(2) Distrust of request volume
Rate limiting – cap maximum requests per service.
Overload protection – drop excess requests during spikes to keep partial availability.
3 The Operational World Is Unpredictable
01 Distrust of Machines
(1) Disaster‑recovery deployment – have at least two machines ready to serve.
(2) Heartbeat detection – monitor machine health, auto‑switch or disable faulty nodes.
02 Distrust of Data Centers
Example: the 2015 Tianjin explosion made multiple data centers unavailable.
(1) Geographic dispersion – deploy across different IDC, cities, or countries.
(2) Capacity redundancy – maintain more than twice the capacity for entry services such as QQ login.
03 Distrust of Power
(1) Disk backup – restore data after power restoration, accepting possible minor loss.
(2) Remote backup – store critical data off‑site to survive disk failure.
04 Distrust of Network
(1) Varying latency – use proximity routing (e.g., CMLB) or network probing (e.g., Q‑调) to choose optimal paths.
(2) Network instability – auto‑disable unreliable nodes, employ local response statistics and periodic probing for recovery.
05 Distrust of Humans
(1) Operation backup – record every step, review critical actions.
(2) Effect verification – review and validate changes in production.
(3) Rollback capability – backup old programs/configurations for quick recovery.
(4) Automated deployment – automate complex deployment processes to avoid human omission.
(5) Consistency checks – monitor version numbers or process/config consistency across machines.
Note: The distrust strategies often need to be combined with other measures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
