Fundamentals 25 min read

Designing for Failure: Principles, Organizational Practices, and Technical Solutions

This article examines why failure is inevitable in software systems, proposes a mindset of failure‑oriented design, outlines organizational roles and processes to mitigate incidents, and presents concrete technical techniques such as distributed locking and traffic shaping to build resilient, high‑availability services.

ByteDance ADFE Team

Oct 12, 2021

Designing for Failure: Principles, Organizational Practices, and Technical Solutions

1. Introduction

The author begins with two real‑world incidents—Alipay’s outage caused by a cut fiber cable and Bilibili’s platform crash—to illustrate how insufficient failure planning can damage a company’s reputation and a programmer’s career.

1.1 Motivation

Failure scenarios, if not anticipated, can lead to severe personal and organizational consequences; learning from past accidents is essential for growth.

1.2 Scope

The article targets backend engineers and technical leaders who have experienced high‑traffic, high‑stakes environments and seeks to share practical experience on failure‑oriented design.

2. Philosophy

2.1 Failure Is Everywhere

Hardware ages, software becomes outdated, traffic spikes unexpectedly, and requirements change constantly; systems must be built assuming inevitable failures.

2.2 Change Is Constant

Never hard‑code assumptions; use configuration, isolate variability, and adopt design patterns that encapsulate change.

2.2.1 Avoid Hard‑Coding

Prefer flexible configurations so that product changes or incident responses can be handled without code modifications.

2.2.2 Isolate Variability

Design patterns (creational, structural, behavioral) serve to lock variability behind abstractions.

2.2.3 Regular Refactoring

Frequent regression testing prevents code decay caused by rapid iteration.

2.3 Vigilance Over Code

Never trust third‑party interfaces, comments, or function inputs blindly; always validate and verify.

2.4 Design Principles

2.4.1 Simplicity

Simple solutions reduce cognitive load, maintenance cost, and improve extensibility, though the most appropriate solution may be more complex for critical paths.

2.4.2 Open‑Closed Principle

Software entities should be open for extension but closed for modification, enabling stable evolution.

2.4.3 Laziness as a Virtue

Automation, tooling, and platformization free engineers from repetitive toil, increasing productivity and reliability.

3. Organization and Process

3.1 Roles

Test engineers, test‑development engineers, risk‑control, and compliance engineers are essential partners for failure‑oriented design.

3.2 Development Process

Typical stages include requirement review (with compliance checks), design (including fail‑over, degradation, rollback), testing (unit, integration, security), staged release (gray rollout), verification, monitoring, incident response, and post‑mortem retrospectives.

3.3 Key Views

3.3.1 Importance of Test Engineers

Design comprehensive test cases covering all scenarios.

Develop data reconciliation scripts and automated testing tools.

Build monitoring and anti‑fraud utilities.

3.3.2 Unit Testing Saves Time

Well‑written unit tests guarantee expected behavior and reduce downstream debugging effort.

3.3.3 Retrospectives Align Standards

Applying the PDCA cycle turns lessons learned into continuous improvement.

3.3.4 R&D Red Line as a Safety Net

Mandated processes and standards protect engineers from low‑level mistakes while enforcing quality.

4. Technical Practices

4.1 Failure as Part of System Design

Apply rate limiting, overload protection, adaptive scaling, timeout handling, graceful degradation, multi‑region active‑active deployment, and automation to mitigate diverse failure modes.

4.2 Distributed Lock – Six Levels

Each level adds guarantees such as atomicity, dead‑lock avoidance, and consistency.

Level 1 – Basic SetNX:

redis.SetNX(ctx, key, "1")
defer redis.del(ctx, key)

Level 2 – SetNX with expiration (atomic via Lua):

redis.SetNX(ctx, key, "1", expiration)
defer redis.del(ctx, key)

Level 3 – Random value + Lua delete for consistency:

redis.SetNX(ctx, key, randomValue, expiration)
defer redis.del(ctx, key, randomValue)
// Lua script for safe delete
if redis.call("get", KEYS[1]) == ARGV[1] then
  return redis.call("del", KEYS[1])
else
  return 0
end

Level 4 – Full lock acquisition with error handling (Go example):

func myFunc() (errCode *constant.ErrorCode) {
    errCode := DistributedLock(ctx, key, randomValue, LockTime)
    defer DelDistributedLock(ctx, key, randomValue)
    if errCode != nil { return errCode }
    // doSomething
}

func DistributedLock(ctx context.Context, key, value string, expiration time.Duration) (errCode *constant.ErrorCode) {
    ok, err := redis.SetNX(ctx, key, value, expiration)
    if err == nil {
        if !ok { return constant.ERR_MISSION_GOT_LOCK }
        return nil
    }
    // handle timeout, check existing lock, retry, etc.
    // ...
    return nil
}

Level 5 – Lease renewal (watchdog) to avoid lock expiration while work is in progress:

// Lua script for lease renewal
if redis.call("get", KEYS[1]) == ARGV[1] then
    return redis.call("expire", KEYS[1], ARGV[2])
else
    return 0
end
// In Go: redis.Cas(ctx, key, value, value)

Level 6 – Handling master‑slave failover and consistency (RedLock or WAIT command):

// RedLock concept – acquire N independent locks
// WAIT command example
redis.Wait(ctx, 1, 2) // wait for write to replicate to 2 replicas

4.3 Hot‑Item Inventory Deduction (Seckill)

Two common patterns: bucketed inventory (prone to uneven consumption) and small‑batch random allocation to smooth load; both benefit from proactive scheduling and traffic shaping.

5. Conclusion

Failure‑oriented design combines a philosophical mindset, disciplined organization, rigorous processes, and concrete technical tactics to reduce the impact of inevitable incidents, ultimately protecting both the system and the engineers who build it.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems High Availability software engineering Failure Design

Written by

ByteDance ADFE Team

Official account of ByteDance Advertising Frontend Team

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.