Understanding the Saga Pattern for Distributed Data Consistency in Microservices
This article explains the importance of data consistency in distributed microservice systems, introduces the Saga pattern as a solution, compares it with two‑phase commit and TCC, and details its architecture, recovery mechanisms, and implementation considerations within ServiceComb.
Data consistency is a crucial issue when building business systems. Traditionally databases guarantee consistency, but in microservice and distributed environments it becomes challenging. ServiceComb, an open‑source microservice framework, addresses this with the ServiceComb‑Saga project to achieve eventual consistency.
Data Consistency in Monolithic Applications
Imagine a large enterprise offering flight, car‑rental and hotel bookings as a one‑stop travel service. In a monolith the three service requests can be placed in a single database transaction, ensuring either all succeed or all roll back.
Data Consistency in Microservice Scenarios
As the business grows, the monolith is split into four independent services—flight, car‑rental, hotel, and payment—each with its own database and communicating via HTTP. The original transaction‑level consistency can no longer be guaranteed by the database.
Sagas
Saga is a long‑lived transaction that can be decomposed into a set of interleavable sub‑transactions, each of which is a real transaction that maintains database consistency.
In our travel‑planning scenario a saga consists of four sub‑transactions: flight booking, car‑rental, hotel booking, and payment.
Saga Execution Principle
Transactions in a saga are linked and must be executed as a (non‑atomic) unit. If any sub‑transaction fails, compensation must be performed. Each sub‑transaction Ti must provide a corresponding compensating transaction Ci.
When every sub‑transaction Ti has a compensating Ci, the saga can either complete all Ti (best case) or execute the compensations Cj…C1 after a failure at step j.
All bookings succeed; otherwise they are cancelled.
If the final payment fails, all previous bookings are also cancelled.
Saga Recovery Methods
The original paper describes two recovery strategies:
Backward recovery – compensate all completed transactions if any sub‑transaction fails. Forward recovery – retry the failed transaction assuming each sub‑transaction will eventually succeed.
Forward recovery does not require compensating transactions, suitable when failures are rare or compensation is impossible.
Conditions for Using Saga
Only two levels of nesting are allowed: a top‑level saga and simple sub‑transactions.
Full atomicity cannot be achieved; sagas may see partial results of other sagas.
Each sub‑transaction must be an independent atomic operation.
In our scenario flight, car‑rental, hotel and payment are naturally independent.
Saga Log
To recover from crashes, the saga persists events such as SagaStarted, TransactionStarted, TransactionEnded, TransactionAborted, TransactionCompensated, and SagaEnded. By storing these events (e.g., as JSON) in a durable store (SQL, NoSQL, message queue, or file), the saga can resume from any intermediate state.
Saga Request Data Structure
The travel‑planning request can be modeled as a directed acyclic graph where the root is the saga start task and leaves are saga end tasks, allowing parallel execution of independent services.
Parallel Saga
Flight, car‑rental and hotel bookings can be processed in parallel, but if one fails while another is still running, the system must handle time‑outs and possibly trigger compensations or manual intervention.
ACID and Saga
Saga does not provide full ACID guarantees because atomicity and isolation are not satisfied. However, it can guarantee consistency and durability through the saga log.
Saga Architecture
The architecture consists of a Saga Execution Component that parses the JSON request and builds the execution graph, a TaskRunner that ensures ordered execution, and a TaskConsumer that writes events to the saga log and invokes remote services.
Two‑Phase Commit (2PC)
2PC is a blocking protocol that coordinates distributed atomic transactions in two phases: a voting phase where participants answer yes/no, and a decision phase where the coordinator sends commit or abort. It suffers from blocking, high latency, and poor scalability.
Try‑Confirm/Cancel (TCC)
TCC is a compensating transaction model with two phases: Try, which puts services in a pending state, and Confirm, which finalizes the operation. It offers easier compensation but requires additional pending‑state handling and incurs higher latency.
Event‑Driven Architecture
Similar to TCC, each service records a pending state and emits events to the next service. If a service crashes after persisting its record but before emitting the event, an auxiliary event table is needed to track progress.
Centralized vs Decentralized Implementations
Centralized sagas use a dedicated coordinator, simplifying consistency logic, while decentralized sagas let each service act as the coordinator, increasing autonomy but requiring each service to implement the consistency protocol.
Summary
The article compares saga with other consistency solutions, showing that saga scales better than 2PC and requires less business‑logic change than TCC while offering higher performance. A centralized saga design decouples services from consistency concerns and simplifies troubleshooting.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.