What Ancient Medicine Teaches About Modern IT Risk Management
Using the classic tale of Bian Que, this article explains how proactive, mid‑stage, and reactive risk controls in IT operations prevent small issues from becoming catastrophic failures, illustrated with real‑world storage, cloud, and equipment‑selection case studies.
Introduction
The well‑known story of Bian Que and Duke Cai is used as an analogy for risk‑management failure in IT operations. An engineer reports a system risk early, but the company ignores it; later the risk grows, still ignored, leading to a system collapse after the engineer leaves.
If you worry about a situation, it becomes more likely to happen.
In reality, Murphy’s law applies: things go wrong, devices fail, lines break, and people make mistakes. Operations engineers deal with these imminent errors daily.
Ensuring system availability is the core responsibility of operations. While rapid incident response showcases engineering skill, business impact is already incurred regardless of speed.
Strengthening only the fast‑response phase cannot achieve the broader goals set for operations.
Extended Bian Que Story
Another dialogue: King Wei asks Bian Que which of his three brothers is the best doctor. Bian Que says the eldest treats disease before it appears, the second treats it at onset, and he himself treats it at severe stage, each gaining different reputations.
Treating Before Illness (Pre‑emptive Control)
In operations, post‑incident control is weaker than mid‑stage control, which is weaker than pre‑emptive control. Post‑incident control = fault handling; mid‑stage = architecture and process design; pre‑emptive = company policies and principles. All three layers are essential and not interchangeable.
Case 1: Dimensions and Stages of Risk Control
High‑end storage designs keep spare disks as hot backups. When a disk fails, the system automatically switches to the spare, addressing risk at the technical/architecture level.
However, if operational processes are lax and failed disks are not replaced, the hot spares are exhausted, leading to severe business interruption. This shows that risk control at different dimensions and stages cannot replace each other.
Case 2: Effective Pre‑emptive Control
In the public IaaS cloud, over‑commitment (overselling) is common. When platform load spikes, customers experience performance degradation, which cannot be solved by mid‑stage or post‑incident measures.
Capital Online set a policy of “no overselling” at the principle level, resulting in minimal performance‑related incidents over years—an example of successful pre‑emptive control.
Case 3: Hidden Costs in Risk Control
Choosing equipment solely on price leads to a fragmented vendor landscape and subtle compatibility issues that surface only under failure, causing major business impact.
If not addressed pre‑emptively, these issues increase hidden costs in staff expertise, equipment maintenance, and automation.
Conclusion
Risk management consists of risk identification, control, and response, with control being the most challenging part.
Returning to the Bian Que analogy, the ruler’s attitude—"I have no illness; treat the healthy"—mirrors how senior management’s understanding of operations determines how far a company can go.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.