Mastering Compensation: When to Rollback vs Retry in Distributed Systems

This article explains the purpose of compensation mechanisms in microservice architectures, compares rollback and retry approaches, outlines their implementation details, discusses idempotency concerns, and provides practical best‑practice recommendations for building resilient distributed systems.

Programmer DD
Programmer DD
Programmer DD
Mastering Compensation: When to Rollback vs Retry in Distributed Systems

Compensation Mechanism Significance

In an e‑commerce scenario, a request typically flows through multiple microservices such as shopping‑cart, order, and payment services, forming a long call chain that is prone to failures at any network or infrastructure component.

Because distributed transactions involve many cross‑machine calls, the probability of errors multiplies, and we must automatically handle these exceptions rather than letting the whole system fail.

What Is Compensation?

Compensation (including transaction compensation and retry) aims to eliminate the inconsistent state caused by an exception, either by rolling back the operation or retrying it.

Any method that resolves an error‑induced inconsistency can be considered a form of compensation.

Rollback

Rollback can be explicit (calling a reverse interface) or implicit (no reverse call needed).

Explicit rollback typically involves two steps:

Identify the failed step and the scope of rollback. If some services lack a rollback interface, place those that provide one earlier in the orchestration so they can be rolled back if later services fail.

Provide the business data needed for rollback. The more data supplied, the more robust the rollback can be. Serializing this data to JSON and storing it in a NoSQL store is recommended.

Implicit rollback is less common and relies on mechanisms like pre‑reserved inventory that automatically expires if payment does not occur within a timeout.

Retry

Retry does not require a reverse interface, reducing long‑term development cost. It should be used when the downstream service returns temporary errors such as timeouts or rate‑limit responses, but not for permanent business errors like insufficient balance.

Common retry strategies include:

Immediate retry : try once more right away.

Fixed interval : retry every N seconds.

Incremental interval : increase the wait time linearly (e.g., 0 s, 3 s, 6 s, …). return (retryCount - 1) * incrementInterval; Exponential interval : double the wait time each attempt. return 2 ^ retryCount; Full jitter : add randomness to the interval. return random(0, 2 ^ retryCount); Equal jitter : combine exponential backoff with a random offset.

var baseNum = 2 ^ retryCount;
return baseNum + random(0, baseNum);

Overly aggressive retry policies (too short intervals or too many attempts) can overload downstream services and should be avoided.

Idempotency Considerations

When implementing retry, ensure the operation is idempotent: repeated calls must not change the final state compared to a single call.

Typical solution:

Assign a globally unique identifier to each request (e.g., UUID or a distributed ID generator).

Before processing, check if the request ID has already been handled; if so, return the previous result or discard the duplicate.

Example implementation:

if (isExistLog(requestId)) {
    var lastResult = getLastResult();
    if (lastResult == null) {
        var result = waitResult();
        return result;
    } else {
        return lastResult;
    }
} else {
    log(requestId);
}
// do something …
logResult(requestId, result);

If compensation is performed via a message queue, embed the unique ID in the message and deduplicate on the consumer side.

Best Practices

Combine retry with rate limiting and circuit breaking; define a clear termination condition; avoid retrying non‑idempotent operations; consider saga patterns for long‑running compensations while ensuring they do not lock scarce resources.

Summary

The article covered why compensation is needed in distributed systems, described two main approaches—rollback and retry—their implementation details, the importance of idempotency, and practical best‑practice recommendations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsRetryIdempotencyCompensationrollback
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.