Cloud Native 23 min read

Designing Fault‑Tolerant Microservices Architecture

Microservice architectures increase system complexity and failure rates, so this article explains key reliability patterns—such as graceful degradation, change management, health checks, self‑healing, fallback caches, retry logic, rate limiting, circuit breakers, and testing—to help engineers design resilient, high‑availability services.

IT Architects Alliance

Oct 13, 2020

Designing Fault‑Tolerant Microservices Architecture

Microservice architecture can isolate failures by clearly defining service boundaries. However, as with any distributed system, network, hardware, and application‑level errors are common. Because of service dependencies, any component may become temporarily unavailable. To minimize the impact of partial outages, we need to build fault‑tolerant services that handle these interruptions gracefully.

This article introduces the most common techniques and architectural patterns for building and operating highly available microservice systems, based on RisingStack’s Node.js consulting and development experience.

If you are not familiar with the patterns described here, it does not necessarily mean you have done something wrong. Building reliable systems always incurs additional cost.

Risks of Microservice Architecture

Microservice architecture moves application logic into services and uses the network layer for communication between them. This replacement of in‑process calls with network calls adds latency and increases system complexity, requiring coordination of multiple physical and logical components. The added complexity of distributed systems also leads to a higher network failure rate.

One of the biggest advantages of microservice architecture is that teams can design, develop, and deploy their services independently, owning the full lifecycle of each service. This also means a team cannot control the services it depends on, which are often managed by different teams. Provider services may become temporarily unavailable due to faulty releases, configuration changes, or other modifications.

Graceful Service Degradation

One of the greatest benefits of microservice architecture is the ability to isolate failures and perform graceful degradation when a component fails. For example, during an outage, a photo‑sharing app’s customers might be unable to upload new images but can still browse, edit, and share existing photos.

Microservice fault tolerance isolation

In most cases, because applications in a distributed system depend on each other, achieving graceful degradation is difficult; you need to apply several fail‑over logics (some of which are introduced later in this article) to prepare for temporary failures and interruptions.

Services depend on each other; without fail‑over logic, all services fail.

Change Management

Google’s Site Reliability Engineering team found that roughly 70 % of incidents are caused by changes to existing systems. When you modify something in a service—deploy a new version of code or change a configuration—there is always a risk of failure or introducing new bugs.

In a microservice architecture, services depend on each other, so you should minimize failures and limit their negative impact. To handle problems introduced by changes, you can implement change‑management strategies and automatic rollback mechanisms.

For example, when you deploy new code or change a configuration, you should perform a small‑scale, incremental rollout (canary deployment) that replaces a subset of service instances. During this period, monitor the instances, and if you observe a negative impact on key metrics, roll back the service immediately.

Change Management – Rollback Deployment

Another solution is to run two production environments. You always deploy to only one, and after verifying that the new version behaves as expected, you point the load balancer to the new environment. This is known as blue‑green or red‑black deployment.

Rolling back code is not a bad thing. You should not leave broken code in production and then wonder what went wrong. If necessary, roll back your code as early as possible.

Health Checks and Load Balancing

Instances may become temporarily or permanently unavailable due to failures, deployments, or automatic scaling. To avoid problems, your load balancer should skip unhealthy instances when routing traffic, because they cannot currently serve customers or downstream systems.

Instance health can be determined externally by repeatedly calling the GET /health endpoint or by self‑reporting. Modern service‑discovery solutions continuously collect health information from instances and configure the load balancer to route traffic only to healthy components.

Self‑Healing

Self‑healing helps applications recover from errors. When an application can take the necessary steps to recover from a failure state, we say it is self‑healing. In most cases this is implemented by an external system that monitors instance health and restarts them if they remain in a failed state for a prolonged period. Self‑healing is useful in many scenarios, but frequent restarts can cause trouble when an application cannot provide a healthy status due to overload or a lost database connection.

For special cases such as a lost database connection, implementing an advanced self‑healing solution can be tricky. You need to add extra logic to handle edge cases and inform the external system that the instance should not be restarted immediately.

Fallback Cache

Because of network issues and system changes, services often fail. However, thanks to self‑healing and load balancing, most interruptions are temporary. We need a solution that allows our services to continue operating during these failures—this is the purpose of a fallback cache, which provides necessary data when a service is down.

Fallback caches typically use two different expiration times: a short time indicating the normal cache expiry, and a longer time indicating how long the cache remains usable during a service failure.

Fallback Cache

Only use a fallback cache when serving stale data is better than serving no data at all.

You can set cache and fallback cache behavior using standard HTTP response headers.

For example, the max-age directive specifies the maximum time a resource is considered fresh. The stale-if-error directive allows you to specify how long a stale resource may be served when an error occurs.

Modern CDNs and load balancers provide various caching and fallback behaviors, but you can also create a shared library for companies that have a standard reliability solution.

Retry Logic

In some cases we cannot cache data or want to modify it, but our operations keep failing. In such cases we can retry the operation because the resource is expected to recover after a short period, or the load balancer will route the request to a healthy instance.

You should add retry logic carefully, because excessive retries can make things worse and even prevent the application from recovering—for example, when a service is overloaded, many retries only exacerbate the situation.

In distributed systems, retries can trigger multiple additional requests and cause a cascade effect. To minimize the impact, limit the number of retries and use an exponential back‑off algorithm that gradually increases the delay between attempts until a maximum limit is reached.

When a client (browser, another microservice, etc.) retries without knowing whether the operation failed before or after processing, you should make your application idempotent. For example, when retrying a purchase operation, you should not charge the customer again. Using a unique idempotency key for each transaction helps handle retries safely.

Rate Limiting and Load Shedding

Rate limiting defines how many requests a particular client or application can receive or process within a given time window. For example, rate limiting can filter out clients that cause traffic spikes or ensure your application does not become overloaded when automatic scaling cannot keep up.

You can also block lower‑priority traffic to reserve sufficient resources for critical transactions.

Rate limiter can prevent traffic spikes

There is a different type of limiter called a concurrent‑request limiter. When you have an important endpoint that should not be called more than a specified number of times, but you still want to keep the service available, this limiter is useful.

Load shedding, a series of techniques, ensures that enough resources are always available for critical transactions. It reserves resources for high‑priority requests and prevents low‑priority transactions from consuming them. Load‑shedding switches make decisions based on overall system health rather than per‑user request volume, helping the system recover during occasional spikes (e.g., a hot event) while keeping core functionality alive.

For more information on rate limiting and load shedding, see the Stripe article.

Fast‑Fail Principle and Independence

In microservice architecture we aim for services to fail fast and be independent. To achieve service‑level fault isolation, we can use the bulkhead pattern. You can read more about bulkheads later in this article.

We also want components to fail fast because we do not want a request to hang until a timeout occurs before the failure is detected. Hanging requests and unresponsive UI waste resources and degrade user experience. Since services call each other, we must prevent hanging operations before latency accumulates.

The first idea is to set explicit timeout levels for each service call. The problem is that you cannot know a reasonable timeout value because network failures and other issues affect only a few operations. In such cases, you may not want to reject those requests outright.

Using static timeouts in microservices is an anti‑pattern; instead, you can apply the circuit‑breaker pattern, which decides whether to allow or block calls based on success‑failure statistics.

Bulkhead Pattern

In industry, bulkheads divide a ship into separate compartments so that damage to one part does not sink the entire vessel.

The bulkhead concept can be applied in software to isolate resources.

By applying bulkheads, we can protect limited resources from being exhausted. For example, for a database instance with a connection limit, we can use two separate connection pools for two different types of operations instead of a single shared pool. This isolation prevents one set of operations from causing timeouts or over‑use that would affect the other.

The Titanic sank partly because its bulkhead design failed; water could flow over the top of the bulkheads, flooding the entire ship.

Titanic bulkhead design (ineffective)

Circuit Breaker

To limit the duration of operations we can use timeouts, which prevent hanging operations and keep the system responsive. However, using static, fine‑grained timeouts in microservices is an anti‑pattern because the environment is highly dynamic and it is almost impossible to define a timeout that works for every case.

An alternative to static timeouts is a circuit breaker. Named after the electrical component, a circuit breaker protects resources and helps them recover. It is especially useful in distributed systems where repeated failures can cause a cascade effect that brings down the entire system.

When a particular type of error occurs repeatedly in a short period, the circuit breaker opens, preventing further requests—similar to a tripped electrical circuit. After a cooldown period, the breaker closes, allowing traffic to resume while giving the underlying service time to recover.

Not all errors should trigger a circuit breaker; for example, you may want to ignore client‑side 4xx errors but not 5xx server errors. Some circuit breakers also have a half‑open state where a single request is sent to test system health; if it succeeds, the breaker closes, otherwise it remains open.

Circuit Breaker

Testing Failures

You should continuously test your system for common failure scenarios to ensure your services can withstand various faults. Regularly testing failures equips your team with the ability to handle incidents.

For testing, you can use external services to identify instance groups and randomly terminate one instance in the group, simulating a single‑instance failure. You can even shut down an entire region to emulate a cloud‑provider outage.

One of the most popular testing tools is Netflix’s Chaos Monkey, an elasticity tool that randomly terminates instances to verify system resilience.

Conclusion

Implementing and operating reliable services is not easy. It requires significant effort and a budget from the organization.

Reliability has many layers and aspects, so finding the solution that best fits your team is important. Reliability should be a factor in business decision‑making, with sufficient budget and time allocated.

Key Takeaways

Dynamic environments and distributed systems (such as microservices) increase failure rates.

Services should achieve fault isolation and graceful degradation to improve user experience.

About 70 % of incidents are caused by changes; rolling back code is not a bad practice.

Achieve fast‑fail and independence; teams cannot control the state of services they depend on.

Patterns and techniques such as caching, bulkheads, circuit breakers, and rate limiting help build reliable microservice architectures.

Translator: a cool name Source: https://github.com/jasonGeng88/blog Original article: https://blog.risingstack.com/designing-microservices-architecture-for-failure/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native microservices Operations fault tolerance circuit breaker

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.