Understanding the Saga Pattern for Distributed Data Consistency in Microservices
This article explains why data consistency is critical in microservice architectures, introduces the Saga pattern and its execution and recovery mechanisms, compares it with two‑phase commit and TCC, and presents a centralized Saga design using ServiceComb for reliable distributed transactions.
Data Consistency in Monolithic Applications
In a monolithic travel‑booking system, a single database transaction can guarantee that flight, car‑rental, and hotel reservations either all succeed or all roll back, ensuring a successful itinerary for the customer.
Data Consistency in Microservice Scenarios
When the same business is split into independent services, each with its own database, a single transaction can no longer guarantee consistency, and coordinating the four services (flight, car, hotel, payment) becomes a complex challenge.
Sagas
A Saga is a long‑running transaction composed of a series of compensatable sub‑transactions; each sub‑transaction is a regular database transaction, and if any step fails, previously completed steps are compensated.
"Saga is a long‑lived transaction that can be broken into interleavable sub‑transactions, each of which maintains database consistency."
In the travel‑planning example, the overall transaction consists of four sub‑transactions: flight booking, car rental, hotel reservation, and payment.
Saga Execution Principle
Sub‑transactions are linked; if a sub‑transaction fails, a corresponding compensating transaction must be executed to undo the work of completed steps.
The system defines each sub‑transaction and its compensation, guaranteeing either full completion or a compensated rollback.
Saga Recovery Methods
The original paper describes two recovery strategies: backward recovery (compensate all completed steps when a failure occurs) and forward recovery (retry the failed step assuming eventual success).
Conditions for Using Saga
Saga allows only two nesting levels (top‑level Saga and simple sub‑transactions).
Full atomicity cannot be achieved; sagas may see partial results of other sagas.
Each sub‑transaction must be an independent atomic operation.
In the travel scenario, flight, car, hotel, and payment are naturally independent and can each be made atomic within its own service.
Compensations may not always restore the exact prior state, but they can semantically undo the effect (e.g., sending a follow‑up email to cancel a previous one).
Saga Log
To recover from crashes, the Saga system persists events such as SagaStarted, TransactionStarted, TransactionEnded, TransactionAborted, TransactionCompensated, and SagaEnded, allowing the Saga to resume from any intermediate state.
Saga Request Data Structure
The travel‑planning request is modeled as a directed acyclic graph where the root node starts the Saga and leaf nodes mark its completion, enabling parallel execution of independent services while preserving the overall order.
Parallel Saga
Parallel execution of flight, car, and hotel bookings can lead to situations where one service fails while others are still processing; the design must ensure that compensations are idempotent and that duplicate requests after compensation are ignored.
ACID and Saga
Saga does not provide full ACID guarantees; atomicity and isolation are relaxed, but consistency and durability are achieved through the compensating logic and persistent saga log.
Saga Architecture
The centralized architecture includes a Saga Execution Component that parses JSON requests into a graph, a TaskRunner that enforces execution order, and a TaskConsumer that writes events to the saga log and invokes remote services.
Two‑Phase Commit (2PC)
Two‑Phase Commit is a distributed algorithm that coordinates participants to either all commit or all abort a transaction.
2PC involves a voting phase and a decision phase, but it blocks resources, incurs high latency, and does not scale well as the number of participants grows.
Try‑Confirm/Cancel (TCC)
TCC splits a transaction into a try phase (reserve resources), a confirm phase (finalize), and a cancel phase (release resources). It offers clearer compensation but requires additional service states and double the communication overhead compared to Saga.
Event‑Driven Architecture
Similar to TCC, each service records a pending state and emits events to the next service; the final service notifies the previous one upon completion, allowing the whole saga to be reconstructed from persisted events.
Centralized vs Decentralized Implementations
Centralized sagas use a dedicated coordinator, simplifying service logic, while decentralized sagas let each service act as the coordinator for the next step, offering autonomy at the cost of added protocol complexity.
Summary
The article compares Saga with 2PC and TCC, showing that Saga scales better than 2PC and requires less business‑logic change than TCC; a centralized Saga design decouples consistency concerns from services, simplifies troubleshooting, and improves performance for compensable transactions.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.