Cloud Native 25 min read

Resilience Strategies for Cloud‑Native Distributed Systems

This article explains how cloud‑native distributed systems achieve higher availability through resilience strategies such as load balancing, timeouts with automatic retries, deadlines, and circuit breakers, describing their placement across OSI layers, implementation options via libraries or proxies, and practical algorithm choices.

Architects Research Society
Architects Research Society
Architects Research Society
Resilience Strategies for Cloud‑Native Distributed Systems

Resilience Strategies

A cloud‑native application architecture composed of many cooperating microservices forms a distributed system. Ensuring the availability of a distributed system—reducing its downtime—requires improving system resilience. Resilience is the use of strategies to increase availability. Examples of resilience strategies include load balancing, timeouts and automatic retries, deadlines, and circuit breakers.

Resilience can be added to a distributed system in more than one way. For example, each microservice’s code can call a library that provides resilience features, or a special network proxy can handle microservice requests and responses. The ultimate goal of resilience is to ensure that the failure or degradation of a specific microservice instance does not cause cascading failures that bring down the entire distributed system.

In the context of a distributed system, resilience means that the system can automatically adapt to adverse conditions and continue to serve its purpose.

The terms “availability” and “resilience” have different meanings. Availability is the percentage of time a distributed system is up. Resilience is the use of strategies to improve the availability of a distributed system.

One main goal of resilience is to prevent a problem in one microservice instance from causing more problems that cascade and eventually cause a distributed‑system failure. This is called a cascading failure.

Resilience Strategies

Resilience strategies for distributed systems are usually applied at multiple layers of the OSI model, as shown in the figure. For example, the physical and data‑link layers (layers 1 and 2) involve physical network components such as the Internet connection, so data‑center and cloud‑service providers are responsible for selecting and implementing resilience strategies for these layers.

The application layer is where the application resides; it is the layer that human users (and other applications) interact with directly. Application‑level (layer 7) resilience strategies are built into the microservices themselves. Developers can design and write applications so that they continue to work in a degraded state, providing essential functionality even if other functions fail because of errors, compromises, or other issues in one or more microservices.

An example can be seen in the recommendation feature of a popular video‑streaming app. Most of the time the home page shows personalized recommendations, but if the related backend components fail, a set of generic recommendations is shown. This failure does not affect the ability to search and play videos.

The transport layer (layer 4) provides network communication capabilities, such as ensuring reliable transmission. Network‑level resilience works at layer 4 to monitor the network performance of each deployed microservice instance and route requests to the optimal instance. For example, if a particular microservice instance stops responding because its location suffers a network outage, new requests are automatically directed to other instances.

Organizations deploying cloud‑native applications as distributed systems should consider network‑ and/or application‑level resilience strategies. Here we will study four such strategies for cloud‑native applications:

Load balancing

Timeouts and automatic retries

Deadlines

Circuit breakers

Load balancing and timeouts/automatic retries support redundancy of distributed‑system components. Deadlines and circuit breakers help reduce the impact of degradation or failure of any part of the distributed system.

Load Balancing

Load balancing for cloud‑native applications can be performed at multiple OSI layers. As we just discussed, load balancing can be done at layer 4 (network/connection level) or layer 7 (application level).

For Kubernetes, layer‑4 load balancing is implemented by default with kube‑proxy. It balances load at the network‑connection level. Management of pod IP addresses and traffic routing between virtual/physical network adapters is handled by the container network interface (CNI) or overlay networks such as Calico or Weave Net.

Consider a scenario where one network connection sends one million requests per second to an application and another connection sends only one request per second. The load balancer sees both connections equally; if it sends the high‑traffic connection to one microservice instance and the low‑traffic connection to another, it would think the load is balanced.

Layer‑7 load balancing is based on the request itself rather than the connection. A layer‑7 load balancer can see the requests within a connection and send each request to the optimal microservice instance, providing better balancing than a layer‑4 balancer. Generally, when we say “load balancing” we refer to layer‑7 load balancing. While layer‑7 load balancing can be applied to services or microservices, here we focus on applying it to microservices.

For cloud‑native applications, load balancing means distributing application requests among the running instances of a microservice. Load balancing assumes each microservice has multiple instances; each instance provides redundancy. As long as possible, instances are distributed, so if a particular server or site fails, not all instances of any microservice become unavailable.

Ideally, each microservice should have enough instances so that even if a failure (e.g., site outage) occurs, there are still sufficient available instances for the distributed system to continue operating for all users who need it at that time.

Load‑Balancing Algorithms

There are many algorithms for performing load balancing. Let’s look at three of them.

Round‑robin is the simplest algorithm. Each microservice instance takes turns handling requests. For example, if microservice A has three instances—1, 2, and 3—the first request goes to instance 1, the second to instance 2, the third to instance 3, then the cycle repeats.

Least‑request is an algorithm that sends a new request to the instance with the fewest pending requests at that moment. For example, if microservice B has four instances and instance 4 currently has only two pending requests while the others have ten, the next request is routed to instance 4.

Session affinity (sticky sessions) tries to send all requests in a session to the same microservice instance. For example, if user Z is using an application and their requests are sent to instance 1 of microservice C, all subsequent requests in the same session are directed to instance 1.

These algorithms have many variants—e.g., weighted versions are added to round‑robin and least‑request so that some instances receive a larger or smaller share of requests. For example, you may want to favor instances that typically process requests faster.

In practice, a single load‑balancing algorithm usually cannot provide enough resilience. For example, they may continue to send requests to an instance that has failed and no longer responds. This is why adding strategies such as timeouts and automatic retries is beneficial.

Timeouts and Automatic Retries

Timeouts are a basic concept in any distributed system. If a part of the system issues a request and another part does not process that request within a certain time, the request times out. The requester can then automatically retry the request using redundant instances of the failing part.

For microservices, timeouts should be established and enforced between two microservices. If microservice A’s instance sends a request to microservice B’s instance and B does not respond in time, the request times out. Microservice A can then automatically retry the request with a different instance of B.

A timeout does not guarantee that a retried request will succeed. If all instances of microservice B share the same problem, any request to any of them may fail. However, if only some instances are affected—e.g., a data‑center outage—retrying is likely to succeed.

Moreover, requests should not always be automatically retried. A common reason is to avoid unintentionally duplicating a transaction that has already succeeded. For example, a request from A to B may have been processed successfully by B, but the reply to A is delayed or lost. In some cases the request can be resent; in others it should not.

Safe transactions (idempotent) produce the same result when the same request is repeated. This is similar to HTTP GET, which retrieves data without changing server state.

Unsafe transactions can produce different results when the same request is repeated. HTTP POST and PUT are examples; repeating them may cause duplicate data or duplicate processing, which is undesirable for payments or orders.

Deadlines

In addition to timeouts, distributed systems also have what is called a distributed timeout, or more commonly a deadline. These involve more than two parts of the system. Suppose four inter‑dependent microservices: A sends a request to B, B processes it and sends its own request to C, C processes it and sends a request to D. Replies flow back from D to C, C to B, B to A.

The diagram below illustrates this scenario. Suppose microservice A must reply within 2.0 seconds. The remaining time to complete the request moves along with the intermediate requests. This allows each microservice to prioritize the request it has received and, when it contacts the next microservice, it informs that microservice of the remaining time.

Circuit Breaker

Timeouts and deadlines handle each request and reply in a distributed system. A circuit breaker has a more “global” view of the distributed system. If a particular microservice instance does not reply to a request or replies more slowly than expected, the circuit breaker may cause subsequent requests to be sent to other instances.

A circuit breaker works by setting a limit on the degree of degradation or failure for a single microservice instance. When an instance exceeds that level, the circuit breaker is triggered and the instance is temporarily taken out of service.

The goal of a circuit breaker is to prevent a problem in one microservice instance from negatively affecting other microservices and potentially causing cascading failures. Once the problem is resolved, the instance can be used again.

Cascading failures often start because of automatic retries against a degraded or failing microservice instance. Suppose you have a microservice instance that is overloaded and responds slowly. If the circuit breaker detects this and temporarily blocks new requests from reaching the instance, the instance has a chance to catch up and recover.

However, if the circuit breaker does not act and new requests continue to be sent to the instance, the instance may fail completely. This forces all requests to be redirected to other instances. If those instances are already near capacity, the new requests may overload them as well, eventually causing them to fail. This loop continues until the entire distributed system fails.

Implementing Resilience Strategies with Libraries

So far we have discussed several resilience strategies, including three forms of load balancing plus timeouts and automatic retries, deadlines, and circuit breakers. Now it is time to consider how to implement these strategies.

When first deploying microservices, the most common way to implement resilience strategies is to let each microservice use a standard library that supports one or more strategies. Hystrix is an example; it is an open‑source library for adding resilience features to distributed systems. Developed by Netflix until 2018, Hystrix calls can be wrapped around any call that a microservice makes to another microservice. Another example is Resilience4j, which is designed for functional programming in Java.

Implementing resilience with application libraries is certainly possible, but it does not suit every situation. Resilience libraries are language‑specific, and microservice developers often use the best language for each microservice, so a resilience library may not support all required languages. To use a resilience library, developers may have to write some microservices in a language that offers sub‑optimal performance or has other major drawbacks.

Another issue is that a library‑based approach means adding a wrapper around every vulnerable call in each microservice. Some calls may be missed, some wrappers may contain bugs—ensuring all developers across all microservices do the same thing is a challenge. There are also maintenance concerns—future developers working on microservices must understand the call wrappers.

Implementing Resilience Strategies with Proxies

Over time, library‑based resilience implementations have been replaced by proxy‑based implementations.

Generally, a proxy sits in the middle of communication between two parties and provides some service for that communication. A proxy usually provides a degree of separation between the parties. For example, A sends a request to B, but the request actually goes from A to the proxy, the proxy processes the request and sends its own request to B. A and B do not communicate directly.

The diagram below shows an example of this communication flow. One session occurs between an instance of microservice A and its proxy, and another separate session occurs between A’s proxy and an instance of microservice B. The A‑to‑proxy and proxy‑to‑B sessions together provide A and B.

In a distributed system, a proxy can implement resilience strategies between microservice instances. Continuing the previous example, when an instance of microservice A sends a request to microservice B, the request actually goes to the proxy. The proxy handles A’s request and decides which instance of B it should forward to, then it issues the request on behalf of A.

The proxy monitors the reply from B’s instance; if it does not receive a timely reply, it can automatically retry the request using a different B instance. In the diagram, microservice A’s proxy has three B instances to choose from; it selects the third. If the third instance’s response is not fast enough, the proxy can use the first or second instance instead.

The main advantage of proxy‑based resilience is that it can be used without modifying individual microservices; any microservice can be placed behind a proxy.

Thank you for following, sharing, liking, and viewing.

cloud nativemicroservicesLoad Balancingresiliencecircuit breakertimeoutsdeadlines
Architects Research Society
Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.