Microservice Fault Tolerance: Timeout, Retry, Circuit Breaker, Rate Limiting, and Service Degradation
This article explains microservice fault‑tolerance techniques—including timeout settings, retry strategies, circuit‑breaker logic, current limiting, resource isolation, and service degradation—from both micro and macro perspectives, illustrating how to design resilient service chains and avoid cascading failures.
1. Introduction
In microservice architectures, the terms consumer and provider refer to the service caller and the service offering an interface, respectively. The diagram shows services (letters) and their multiple instances (numbers).
2. Microscopic View
2.1 Timeout
When a consumer calls a provider, the provider may respond slowly; if the response exceeds a configured threshold, the consumer aborts the call to protect its own performance. The timeout value is typically set based on the provider's normal response time plus a buffer.
2.2 Retry
Retries mitigate occasional provider glitches. After a timeout, a retry can be attempted, possibly on a different instance, to recover the request. For retries to be safe, the provider must support idempotent operations, ensuring repeated calls have the same effect.
2.3 Circuit Breaker
If a provider consistently exceeds timeout thresholds, the consumer may short‑circuit the call and return a mock response, preventing further load on the failing service. Once the provider stabilizes, normal calls resume.
2.4 Rate Limiting (Current Limiting)
Providers may limit incoming traffic from each consumer based on importance and typical QPS, preventing a single consumer from overwhelming the service. Both provider and consumer should isolate resources to avoid exhausting thread pools.
2.4.1 Resource Isolation
Providers enforce limits on consumer traffic; consumers also isolate threads used for outbound calls to protect their own pools.
2.4.2 Service Degradation
When a provider experiences anomalies (e.g., frequent timeouts), the consumer can degrade the service by returning fixed data or caching writes for asynchronous processing.
3. Macroscopic View
In longer call chains (A → B → C → D), timeout, retry, circuit‑breaker, and rate‑limiting settings must be coordinated across services. Misaligned timeouts (e.g., A’s timeout shorter than B’s) cause inefficiencies. Retry counts and intervals should consider downstream service costs.
3.1 Timeout Coordination
Timeouts should satisfy TAB > RB + TBC, where TAB is the total call time, RB is consumer processing time, and TBC is provider response time.
3.2 Retry Coordination
Retry logic must account for downstream latency; excessive retries on costly services waste resources.
3.3 Circuit Breaker Propagation
If C fails and B trips its circuit breaker, A does not need to break again; the failure is contained.
3.4 Rate Limiting Alignment
Upstream request limits should respect downstream capacities; dynamic cluster‑wide limits help adapt to scaling.
3.5 Service Degradation Strategy
Prioritize degrading low‑priority interfaces first; if the entire chain shows performance issues, degrade from outer to inner services, or let the overloaded service self‑degrade.
3.6 Ripple Effect
Transient glitches in a downstream service can propagate upward, causing temporary instability across the chain.
3.7 Cascading Failure
Severe failures in one service can cause widespread unavailability; unlike ripples, cascading failures demand immediate attention.
3.8 Critical Path
The critical path comprises essential downstream services (e.g., databases). Reducing dependencies on the critical path improves overall stability.
3.9 Longest Path Optimization
Optimizing the longest call path in the service graph yields the greatest performance gains.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
