Backend Development 17 min read

Handling Timeouts in Microservices: Strategies and Best Practices

This article explains why timeouts are inevitable in distributed systems, illustrates the challenges they create, and presents five practical strategies—including using defaults, safe retries, status checks, and user‑focused fallback—to manage slow or failed API calls in microservice architectures.

Architects Research Society

Jul 5, 2022

Handling Timeouts in Microservices: Strategies and Best Practices

Microservices are important and can bring great wins to our architecture and teams, but they also incur many costs. As microservices, serverless, and other distributed system architectures become more common, internalizing their problems and solutions is crucial. This article examines one tricky issue that network boundaries can introduce: timeouts.

Before you fear the term "distributed system," remember that even a small React app with a Node backend or a simple iOS client talking to AWS Lambda represents a distributed system. While reading this post, you are already involved in a distributed system that includes your web browser, a CDN, and a file storage system.

For background, I assume you know how to make API calls in your language of choice and handle success and failure, whether the calls are synchronous or asynchronous, HTTP or not. If you encounter unfamiliar terms or ideas, feel free to discuss them on Twitter or elsewhere; I try to add links where appropriate.

The problem we will explore is: if we encounter a very, very slow API call that eventually times out, and we assume (a) it succeeded or (b) it failed, we will encounter an error. Timeouts (or worse, infinite waits) are a fundamental fact of distributed systems that we need to know how to handle.

Problem

Let's start with a thought experiment: have you ever emailed a colleague asking for something?

[Tuesday, 9:58 am] You: "Hey, can you add me to our company's potential mentor list?"

Colleague: "..."

[Friday, 2:30 pm] You: [?]

What do you do?

If you want your request to be satisfied, you eventually need to determine that there is no reply. Do you wait longer? How long are you willing to wait?

Once you decide how long to wait, what action do you take? Do you resend the email? Try a different communication channel? Do you assume they won't respond?

Now, what exactly is happening? We want to see this request‑response behavior:

But something went wrong. Several possibilities exist:

They never received the message.

They received the email, processed it successfully, and sent you a reply that you never saw (or it went to spam).

They got the message but are still thinking about it, lost it, or simply forgot.

In the end, we just don't know!

This exact problem appears in any communication across a distributed system.

We may delay our request, processing, or response for an arbitrarily long time. Therefore, just like the email example, we need to answer the question "How long should we wait?" – a duration we call a timeout.

If you take away only one lesson from this article, let it be: use timeouts . Otherwise, you risk waiting forever for operations that never complete.

But once we hit the timeout limit, what do we do?

Methods

When people encounter timeouts in remote system calls, there are several common approaches. This list is not exhaustive, but it covers many of the most common scenarios I have seen.

Method #1

Assume the operation succeeded and continue.

Do not do this. Unfortunately, this is a common unconscious choice that can lead to very bad user‑experience outcomes even in production. If we assume the operation succeeded, our poor consumers will reasonably assume everything went fine, only to be disappointed and confused when they later discover the result.

Whenever you have a network call, look for both success and failure handling. For example, if you use Promise.then(...) in JavaScript, ask yourself where the corresponding .catch(...) is. If it is missing, you almost certainly have an error.

In very special cases you might genuinely not care whether the request succeeded or failed (e.g., UDP). But do not make this your default – exhaust other options first.

Method #2

For read‑only requests, use a cache or default value.

If the request is a read‑only operation that does not affect the remote side, you can return a previously cached value. If no successful request exists or caching makes no sense, fall back to a default value. This approach is simple and adds little performance overhead, but remember that external caches (e.g., memcached, Redis) can themselves time out.

Method #3

Assume the remote operation failed and automatically retry.

This raises many questions:

What if retrying is unsafe? Duplicate charges or repeated side‑effects can be disastrous.

Should retries be synchronous or asynchronous?

If synchronous, will retries slow down the consumer? This may violate service‑level expectations.

If asynchronous, how do we inform the consumer about success? Batch or single attempts?

How many retries? One, two, ten, or until success?

What delay strategy? Exponential back‑off, jitter, max wait cap?

Will retries worsen an overloaded remote server?

If the remote API can be safely retried, we call it idempotent . Non‑idempotent APIs can cause duplicate data (e.g., double‑charging a credit card) or race conditions.

Making automatic retries safe often requires architectural work, such as sending a request UUID and having the remote side track it. Stripe’s API is a good real‑world example.

Method #4

Check if the request succeeded, and retry only if it is safe.

The idea is to follow a timed‑out request with another request that queries the original request’s status. This requires an endpoint that can report the status. If the endpoint says the request succeeded, we do not retry.

However, this approach has a serious flaw: the status endpoint may itself time out, or the remote service may still be processing, leaving us unable to confirm success.

Method #5

Give up and let the user figure it out.

This requires the least effort and can prevent us from making the wrong decision, making it the best choice in many cases. We must ask whether the user has enough information to determine the correct action.

In some scenarios, informing the consumer about the problem may be the best approach. If we do not want unlimited retries, we may eventually fall back to this path.

Conclusion

Distributed systems are hard, and there is no single silver‑bullet solution. If you feel stuck, don’t let perfection become the enemy of good.

Use timeouts.

Even with long timeout values (5 s, 10 s, or more), every network request should have a timeout. Choosing the right timeout is tricky – you don’t want too many false failures, nor do you want to waste time and risk an unhealthy application. Analyze historical request latency distributions and your app’s risk profile to pick a good value.

In any case, you don’t want your application server’s queue, connection pool, or buffers blocked by forever‑waiting operations. You can research more advanced patterns like circuit breakers, but timeouts are cheap and well‑supported by libraries. Use them!

Make retries safe by default.

Beyond simplifying your code, making operations idempotent is a valuable practice.

Consider delegating work differently.

Asynchronous messaging offers attractive properties because the remote service no longer needs to stay fast and available; only the message broker does. However, messaging/async is not a cure‑all – you still need to ensure the broker receives the message, which can be difficult. Message brokers have trade‑offs, and consumers may have expectations about retry timing.

Don’t forget you can sometimes eliminate the network boundary altogether – combine services, reduce the number of calls, or inline functionality. Whatever approach you choose, remember that users don’t care whether you use microservices; they just want things to work.

Thank you for following, sharing, liking, and watching.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Operations retry strategies timeouts

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.