Designing for Failure: Principles, Organizational Practices, and Technical Solutions
This article examines why failure is inevitable in software systems, proposes a mindset of failure‑oriented design, outlines organizational roles and processes to mitigate incidents, and presents concrete technical techniques such as distributed locking and traffic shaping to build resilient, high‑availability services.
1. Introduction
The author begins with two real‑world incidents—Alipay’s outage caused by a cut fiber cable and Bilibili’s platform crash—to illustrate how insufficient failure planning can damage a company’s reputation and a programmer’s career.
1.1 Motivation
Failure scenarios, if not anticipated, can lead to severe personal and organizational consequences; learning from past accidents is essential for growth.
1.2 Scope
The article targets backend engineers and technical leaders who have experienced high‑traffic, high‑stakes environments and seeks to share practical experience on failure‑oriented design.
2. Philosophy
2.1 Failure Is Everywhere
Hardware ages, software becomes outdated, traffic spikes unexpectedly, and requirements change constantly; systems must be built assuming inevitable failures.
2.2 Change Is Constant
Never hard‑code assumptions; use configuration, isolate variability, and adopt design patterns that encapsulate change.
2.2.1 Avoid Hard‑Coding
Prefer flexible configurations so that product changes or incident responses can be handled without code modifications.
2.2.2 Isolate Variability
Design patterns (creational, structural, behavioral) serve to lock variability behind abstractions.
2.2.3 Regular Refactoring
Frequent regression testing prevents code decay caused by rapid iteration.
2.3 Vigilance Over Code
Never trust third‑party interfaces, comments, or function inputs blindly; always validate and verify.
2.4 Design Principles
2.4.1 Simplicity
Simple solutions reduce cognitive load, maintenance cost, and improve extensibility, though the most appropriate solution may be more complex for critical paths.
2.4.2 Open‑Closed Principle
Software entities should be open for extension but closed for modification, enabling stable evolution.
2.4.3 Laziness as a Virtue
Automation, tooling, and platformization free engineers from repetitive toil, increasing productivity and reliability.
3. Organization and Process
3.1 Roles
Test engineers, test‑development engineers, risk‑control, and compliance engineers are essential partners for failure‑oriented design.
3.2 Development Process
Typical stages include requirement review (with compliance checks), design (including fail‑over, degradation, rollback), testing (unit, integration, security), staged release (gray rollout), verification, monitoring, incident response, and post‑mortem retrospectives.
3.3 Key Views
3.3.1 Importance of Test Engineers
Design comprehensive test cases covering all scenarios.
Develop data reconciliation scripts and automated testing tools.
Build monitoring and anti‑fraud utilities.
3.3.2 Unit Testing Saves Time
Well‑written unit tests guarantee expected behavior and reduce downstream debugging effort.
3.3.3 Retrospectives Align Standards
Applying the PDCA cycle turns lessons learned into continuous improvement.
3.3.4 R&D Red Line as a Safety Net
Mandated processes and standards protect engineers from low‑level mistakes while enforcing quality.
4. Technical Practices
4.1 Failure as Part of System Design
Apply rate limiting, overload protection, adaptive scaling, timeout handling, graceful degradation, multi‑region active‑active deployment, and automation to mitigate diverse failure modes.
4.2 Distributed Lock – Six Levels
Each level adds guarantees such as atomicity, dead‑lock avoidance, and consistency.
Level 1 – Basic SetNX:
redis.SetNX(ctx, key, "1")
defer redis.del(ctx, key)Level 2 – SetNX with expiration (atomic via Lua):
redis.SetNX(ctx, key, "1", expiration)
defer redis.del(ctx, key)Level 3 – Random value + Lua delete for consistency:
redis.SetNX(ctx, key, randomValue, expiration)
defer redis.del(ctx, key, randomValue)
// Lua script for safe delete
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
endLevel 4 – Full lock acquisition with error handling (Go example):
func myFunc() (errCode *constant.ErrorCode) {
errCode := DistributedLock(ctx, key, randomValue, LockTime)
defer DelDistributedLock(ctx, key, randomValue)
if errCode != nil { return errCode }
// doSomething
}
func DistributedLock(ctx context.Context, key, value string, expiration time.Duration) (errCode *constant.ErrorCode) {
ok, err := redis.SetNX(ctx, key, value, expiration)
if err == nil {
if !ok { return constant.ERR_MISSION_GOT_LOCK }
return nil
}
// handle timeout, check existing lock, retry, etc.
// ...
return nil
}Level 5 – Lease renewal (watchdog) to avoid lock expiration while work is in progress:
// Lua script for lease renewal
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("expire", KEYS[1], ARGV[2])
else
return 0
end
// In Go: redis.Cas(ctx, key, value, value)Level 6 – Handling master‑slave failover and consistency (RedLock or WAIT command):
// RedLock concept – acquire N independent locks
// WAIT command example
redis.Wait(ctx, 1, 2) // wait for write to replicate to 2 replicas4.3 Hot‑Item Inventory Deduction (Seckill)
Two common patterns: bucketed inventory (prone to uneven consumption) and small‑batch random allocation to smooth load; both benefit from proactive scheduling and traffic shaping.
5. Conclusion
Failure‑oriented design combines a philosophical mindset, disciplined organization, rigorous processes, and concrete technical tactics to reduce the impact of inevitable incidents, ultimately protecting both the system and the engineers who build it.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
