Business Compensation in Distributed Systems: Rollback, Retry, and Consistency
This article explains how distributed systems handle business inconsistencies through compensation mechanisms, detailing rollback versus retry approaches, their implementation patterns such as explicit/implicit rollback, various retry strategies, and the trade‑offs between ACID and BASE consistency models for achieving eventual consistency.
Introduction
In distributed applications, a single business process often involves multiple services and traverses DNS, network cards, switches, routers, and load balancers. Any failure in these components can cause errors, a problem that becomes more pronounced in micro‑service architectures where consistency must be guaranteed.
What Is Business Compensation?
Business compensation refers to the mechanisms used to eliminate inconsistent states when an operation throws an exception. It ensures that a failed step either retries until success or rolls back to a previous stable state.
Design Approaches for Business Compensation
There are two primary ways to implement compensation:
Rollback (transaction compensation) : Perform reverse operations to undo the business flow, effectively abandoning the current operation.
Retry : Re‑execute the forward operation, assuming the failure is temporary and a successful attempt is still possible.
Typically, a workflow engine orchestrates these services and applies compensation to achieve eventual consistency.
Rollback
Rollback restores a program or data to its last correct version. In distributed compensation, rollback means returning services to their state before the failed call.
Explicit vs. Implicit Rollback
Explicit rollback : Calls a reverse interface or cancels an unfinished operation, requiring resource locks.
Implicit rollback : Relies on downstream services to handle failures automatically.
Explicit rollback usually involves two steps:
Identify the failed step and determine the rollback scope. Services lacking a rollback interface should be placed earlier in the orchestration.
Provide sufficient business data for the rollback (e.g., account balances, amounts) to enable thorough validation.
Retry
Retry assumes a failure is temporary. It avoids the need for reverse interfaces, reducing maintenance cost, especially when business logic changes frequently.
When to Use Retry
Retry is suitable for transient errors such as timeouts, rate‑limiting, or temporary unavailability (e.g., HTTP 503/404). It is not appropriate for permanent business errors like insufficient balance or permission denial.
Retry Strategies
Common strategies include:
Immediate retry : One quick retry after a failure; if it fails again, switch to another strategy.
Fixed interval : Retry at a constant interval (e.g., every 5 minutes).
Incremental interval : Increase the wait time linearly (e.g., 0 s, 5 s, 10 s…). return (retryCount - 1) * incrementInterval; Exponential backoff : Double the interval each time. return 2 ^ retryCount; Full jitter : Add randomness to the exponential backoff. return random(0 , 2 ^ retryCount); Equal jitter : Combine exponential backoff with a random offset.
int baseNum = 2 ^ retryCount;
return baseNum + random(0 , baseNum);The following chart illustrates the behavior of strategies 3‑6 (x‑axis: retry count):
Idempotency Requirement
Endpoints that are retried must be idempotent; repeated calls should not cause cumulative changes. This is achieved by assigning a unique identifier to each request and discarding duplicates.
Consistency Models: ACID vs. BASE
ACID provides strong consistency but poor scalability; BASE offers weaker consistency with better scalability and is suitable for most distributed transactions. In retry or rollback scenarios, eventual consistency (BASE) is usually sufficient.
Design Considerations for Business Compensation
All services involved must support idempotency and provide retry mechanisms upstream.
A high‑availability workflow engine should monitor and control the entire compensation process.
Compensation logic does not have to be a strict inverse; it can be parallel or simplified.
Compensation is highly business‑specific and rarely generic.
Downstream services should offer short‑term resource reservation (e.g., reserving inventory for 15 minutes) to enable safe rollback.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
