Designing Resilient Microservices: Fault Tolerance, Health Checks, and Reliability Patterns
This article explains how to build highly available microservice systems by addressing the risks of distributed architectures, employing graceful degradation, change management, health checks, self‑healing, failover caching, retry and rate‑limiting strategies, bulkhead and circuit‑breaker patterns, and continuous failure testing.
Hello everyone, I’m Chen.
Microservice architecture isolates faults through well‑defined service boundaries, but network, hardware, and application errors are common; any component may become temporarily unavailable, so fault‑tolerant services are needed to handle interruptions gracefully.
This article introduces the most common techniques and architectural patterns for building and operating highly available microservice systems based on RisingStack’s Node.js consulting and development experience.
Not knowing these patterns does not mean you are doing something wrong; building reliable systems always incurs additional cost.
Risks of Microservice Architecture
Moving application logic into services and communicating over the network adds latency and system complexity, increasing the likelihood of network failures.
One major advantage of microservices is that teams can independently design, develop, and deploy their services, owning the full lifecycle; however, they cannot control dependent services, which may become temporarily unavailable due to errors, configuration changes, or new releases.
Graceful Service Degradation
Microservices allow fault isolation, enabling graceful degradation; for example, during an outage a photo‑sharing app might prevent new uploads but still allow browsing, editing, and sharing existing photos.
Microservice fault‑tolerance isolation
Implementing graceful degradation usually requires several failover logics, which are described later in this article.
Without failover logic, dependent services can all fail.
Change Management
Google’s Site Reliability Engineering team found that about 70 % of incidents are caused by changes to existing systems; deploying new code or configuration can introduce failures or bugs.
To mitigate change‑related problems, implement change‑management strategies and automatic rollback mechanisms.
For example, when deploying new code or changing configuration, perform a small‑scale, progressive replacement of service instances (canary deployment); monitor key metrics and roll back immediately if they degrade.
Rollback Deployment
Another solution is to run two production environments (blue‑green or red‑black deployment); only one is active, and traffic is switched to the new version after verification.
Rolling back code early is not a bad practice; the sooner you revert problematic code, the better.
Health Checks and Load Balancing
Instances may be started, restarted, or stopped due to failures, deployments, or autoscaling, making them temporarily or permanently unavailable. Load balancers should skip unhealthy instances to avoid routing traffic to them.
Application health can be observed externally by repeatedly calling GET /health or via self‑reporting. Modern service‑discovery solutions continuously collect health information and configure load balancers to route traffic only to healthy components.
Self‑Healing
Self‑healing helps applications recover from errors, typically via an external system that monitors instance health and restarts unhealthy instances after a prolonged failure. However, frequent restarts can be problematic when the root cause is overload or a lost database connection.
For special cases like a lost DB connection, additional logic is needed to prevent immediate restarts and inform the external system that the instance should not be restarted right away.
Failover Cache
Failover caching provides data when a service fails, using two expiration times: a short TTL for normal operation and a longer TTL that keeps cached data usable during service outages.
Failover cache
Use failover cache only when stale data is preferable to no data.
Standard HTTP response headers can configure cache and failover cache. The max-age attribute defines the maximum freshness period, while stale-if-error allows serving stale content for a defined time when an error occurs.
Modern CDNs and load balancers provide various caching and failover behaviors; you can also create a shared library for companies with reliable solutions.
Retry Logic
When operations eventually succeed, retrying can be useful, but excessive retries may worsen overload or prevent recovery. Limit the number of retries and use exponential backoff to increase the delay between attempts up to a maximum.
Ensure idempotency for retryable operations (e.g., use a unique idempotency key for purchase requests) to avoid duplicate side effects.
Rate Limiter and Load Shedding
Rate limiting defines how many requests a client or application can receive or process within a time window, helping filter traffic spikes and prevent overload.
It can also block lower‑priority traffic, reserving resources for critical transactions.
Rate limiter can prevent traffic spikes
A concurrent request limiter is useful for important endpoints that should not be called more than a specified number of times while still providing service.
Load shedding reserves resources for high‑priority requests and denies low‑priority ones based on overall system state, helping the system stay functional during sudden spikes.
For more details, see the Stripe article on rate limiting and load shedding.
Fast‑Fail Principle and Independence
Microservices should fail fast and remain independent; the bulkhead pattern isolates resources, preventing a failure in one service from exhausting shared resources.
Avoid hanging requests and static timeout values, which are anti‑patterns in highly dynamic environments; instead, use circuit breakers to handle errors.
Bulkhead Pattern
Applying bulkheads isolates limited resources, such as using separate connection pools for a database instance to prevent one workload from exhausting connections for others.
The Titanic’s sinking illustrates a bulkhead failure where water passed over the deck, flooding the hull.
Titanic bulkhead design (ineffective)
Circuit Breaker
Static timeouts are a anti‑pattern; instead, use circuit breakers to protect resources. When many errors occur in a short period, the breaker opens, stopping further requests, and closes after a cooldown period once the service recovers.
Not all errors should trigger a breaker; for example, client‑side 4xx errors may be ignored while server‑side 5xx errors trigger it. Some breakers have a half‑open state that sends a test request to check availability before fully closing.
Testing Failures
Continuously test common failure scenarios to ensure services can withstand them; use tools like Netflix’s Chaos Monkey to terminate random instances or even whole zones to simulate cloud provider failures.
Conclusion
Implementing and operating reliable services is not easy; it requires significant effort and budget.
Reliability has many layers; choose solutions that fit your team, make reliability a factor in business decisions, and allocate sufficient budget and time.
Key Takeaways
Dynamic environments and distributed systems increase failure probability.
Services should isolate faults and provide graceful degradation to improve user experience.
About 70 % of incidents are caused by changes; rolling back code is not a bad practice.
Achieve fast‑fail and independence; teams cannot control the health of dependent services.
Patterns and technologies such as caching, bulkheads, circuit breakers, and rate limiters help build reliable microservice architectures.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.