Mastering Compensation: When to Rollback vs Retry in Distributed Systems
This article explains the purpose of compensation mechanisms in microservice architectures, compares rollback and retry approaches, outlines their implementation details, discusses idempotency concerns, and provides practical best‑practice recommendations for building resilient distributed systems.
Compensation Mechanism Significance
In an e‑commerce scenario, a request typically flows through multiple microservices such as shopping‑cart, order, and payment services, forming a long call chain that is prone to failures at any network or infrastructure component.
Because distributed transactions involve many cross‑machine calls, the probability of errors multiplies, and we must automatically handle these exceptions rather than letting the whole system fail.
What Is Compensation?
Compensation (including transaction compensation and retry) aims to eliminate the inconsistent state caused by an exception, either by rolling back the operation or retrying it.
Any method that resolves an error‑induced inconsistency can be considered a form of compensation.
Rollback
Rollback can be explicit (calling a reverse interface) or implicit (no reverse call needed).
Explicit rollback typically involves two steps:
Identify the failed step and the scope of rollback. If some services lack a rollback interface, place those that provide one earlier in the orchestration so they can be rolled back if later services fail.
Provide the business data needed for rollback. The more data supplied, the more robust the rollback can be. Serializing this data to JSON and storing it in a NoSQL store is recommended.
Implicit rollback is less common and relies on mechanisms like pre‑reserved inventory that automatically expires if payment does not occur within a timeout.
Retry
Retry does not require a reverse interface, reducing long‑term development cost. It should be used when the downstream service returns temporary errors such as timeouts or rate‑limit responses, but not for permanent business errors like insufficient balance.
Common retry strategies include:
Immediate retry : try once more right away.
Fixed interval : retry every N seconds.
Incremental interval : increase the wait time linearly (e.g., 0 s, 3 s, 6 s, …). return (retryCount - 1) * incrementInterval; Exponential interval : double the wait time each attempt. return 2 ^ retryCount; Full jitter : add randomness to the interval. return random(0, 2 ^ retryCount); Equal jitter : combine exponential backoff with a random offset.
var baseNum = 2 ^ retryCount;
return baseNum + random(0, baseNum);Overly aggressive retry policies (too short intervals or too many attempts) can overload downstream services and should be avoided.
Idempotency Considerations
When implementing retry, ensure the operation is idempotent: repeated calls must not change the final state compared to a single call.
Typical solution:
Assign a globally unique identifier to each request (e.g., UUID or a distributed ID generator).
Before processing, check if the request ID has already been handled; if so, return the previous result or discard the duplicate.
Example implementation:
if (isExistLog(requestId)) {
var lastResult = getLastResult();
if (lastResult == null) {
var result = waitResult();
return result;
} else {
return lastResult;
}
} else {
log(requestId);
}
// do something …
logResult(requestId, result);If compensation is performed via a message queue, embed the unique ID in the message and deduplicate on the consumer side.
Best Practices
Combine retry with rate limiting and circuit breaking; define a clear termination condition; avoid retrying non‑idempotent operations; consider saga patterns for long‑running compensations while ensuring they do not lock scarce resources.
Summary
The article covered why compensation is needed in distributed systems, described two main approaches—rollback and retry—their implementation details, the importance of idempotency, and practical best‑practice recommendations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
