Understanding Faults and Fault Isolation Strategies in Distributed Systems
The article explains what constitutes a fault, introduces key metrics such as RPO and RTO, and describes various fault isolation principles, patterns, and practical examples—including dependency degradation, failover, dynamic adjustment, fast‑fail, caching, rate limiting, and resource isolation—to improve system reliability.
In simple terms, a fault occurs when a function or performance does not meet expectations.
Two important fault metrics are:
RPO (Recovery Point Objective) : the maximum tolerable data loss, especially critical for financial services where RPO must be zero.
RTO (Recovery Time Objective) : the maximum tolerable service downtime.
Fault Isolation from a Single‑System Perspective
A distributed system must assume that faults can happen at any time and design for isolation.
Purpose of Fault Isolation
Fault isolation reduces impact by limiting fault scope, protecting key business and customers, and enabling rapid fault source identification for recovery.
Basic Principles of Fault Isolation
Cut off dependencies when a fault occurs.
Isolate services or resources to avoid sharing.
Avoid synchronous calls.
Common Fault Isolation Patterns
1. Dependency Degradation
Default Degradation
When a dependent component fails, apply a default handling strategy instead of propagating the error.
Example 1: If a cache fails, fall back to database reads.
Example 2: In payment, if the quota service fails, allow small withdrawals without quota checks and later reconcile.
Dynamic Switch (Failover)
Switch to a standby solution when a fault occurs.
Example 1: Database master‑slave failover using HA heartbeat.
Example 2: For streaming data, switch to a fresh FO (Fail‑Over) database to continue writes while preserving old data.
Example 3: For message‑type data, use active‑active nodes; if one fails, only half the data is affected.
2. Dynamic Request Adjustment
Automatically adjust call frequency or drop unhealthy nodes based on latency or errors.
3. Fast Fail
When a dependency is unavailable, quickly fail the request to avoid exhausting resources.
4. Cache Dependent Data
Local caching of critical data provides a fallback when the source system is down, with strategies for consistency.
5. Reduce or Eliminate Low‑Level Dependencies
Avoid relying on lower‑level systems whose availability could drag down higher‑level services.
6. Log Level Degradation
Lower logging verbosity (e.g., from INFO to WARN) during high load to reduce I/O overhead.
7. Service or Resource Isolation
Isolate resources at various levels (user, business function, system) to prevent a fault in one area from affecting others.
8. Asynchronous Processing
Convert synchronous calls to asynchronous workflows to avoid tight coupling.
9. Staged Processing
Break processing into independent stages (e.g., payment acceptance, processing, callback) to contain failures within a stage.
UC Tech Team
We provide high-quality technical articles on client, server, algorithms, testing, data, front-end, and more, including both original and translated content.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.