Mastering Distributed Transactions: Theory, Patterns, and Practical Solutions

This article explains the fundamentals of distributed transactions, compares classic solutions such as XA, Saga, TCC, local message tables, and transaction messages, and presents a sub‑transaction barrier technique that gracefully handles idempotency, empty rollbacks, and hanging issues in microservice architectures.

ITFLY8 Architecture Home
ITFLY8 Architecture Home
ITFLY8 Architecture Home
Mastering Distributed Transactions: Theory, Patterns, and Practical Solutions

Fundamental Theory

Before diving into specific solutions, we first review the basic concepts involved in distributed transactions.

Using a money transfer example, user A transfers 100 units to user B, requiring a debit of 100 from A and a credit of 100 to B. Both operations must succeed together or fail together.

Transaction

A transaction groups multiple statements into a single atomic unit, ensuring that all operations either fully succeed or fully fail.

Transactions have four properties—Atomicity, Consistency, Isolation, Durability—collectively known as ACID.

Atomicity : All operations in a transaction are completed or none are; on error the system rolls back to the state before the transaction.

Consistency : The database integrity (foreign keys, constraints) is preserved before and after the transaction.

Isolation : Concurrent transactions do not interfere with each other, preventing inconsistent data.

Durability : Once a transaction commits, its changes survive system failures.

If the business logic is simple and resides in a single database/service, a local database transaction can guarantee correct transfer.

Distributed Transaction

Cross‑bank transfers illustrate a typical distributed transaction scenario: the debit and credit happen on different banks, so a single‑node ACID transaction cannot guarantee correctness; a distributed transaction is required.

A distributed transaction involves a coordinator and multiple resource managers located on different nodes. It aims to ensure correct data operations across distributed systems.

To meet availability and performance needs, distributed transactions often relax strict ACID requirements and follow the BASE principles (Basic Availability, Soft state, Eventual consistency). They still retain some ACID properties:

Atomicity: strictly enforced.

Consistency: strict after commit; may be relaxed during the transaction.

Isolation: parallel transactions should not affect each other; intermediate results may be visible under controlled conditions.

Durability: strictly enforced.

Distributed Transaction Solutions

No single solution can satisfy all business requirements; the choice depends on the specific characteristics of the workload.

Two‑Phase Commit (XA)

XA, defined by the X/Open group, specifies the interface between a global transaction manager (TM) and local resource managers (RM). Databases such as MySQL, Oracle, SQL Server, and PostgreSQL act as RMs.

XA consists of two phases:

1. Prepare : All participants lock required resources and report readiness to the TM. 2. Commit/Rollback : After the TM confirms all participants are ready, it sends a commit command; otherwise a rollback is issued.

XA involves a TM, one or more RMs, and the application program.

These three roles (RM, TM, AP) are the classic division that recurs in Saga, TCC, and other patterns.

A successful XA transaction for the transfer example follows the sequence diagram below:

XA transaction diagram
XA transaction diagram

If any participant fails during the prepare phase, the TM instructs all prepared participants to roll back.

Characteristics of XA:

Simple and easy to understand; development is straightforward.

Resources remain locked for a long time, resulting in low concurrency.

Further reading on XA implementations is available for Go, PHP, Python, Java, C#, and Node.js via the DTM project.

Saga

Saga, introduced in the original “sagas” database paper, splits a long transaction into multiple local short transactions coordinated by a Saga orchestrator. If any step fails, compensating actions are executed in reverse order.

The transfer example’s successful Saga flow is illustrated below:

Saga transaction diagram
Saga transaction diagram

During the Cancel phase, the compensating operation must not fail; if network issues prevent a successful response, the TM repeatedly retries until success.

Characteristics of Saga:

High concurrency; no long‑term resource locks.

Requires definition of both normal and compensating operations, increasing development effort compared to XA.

Weaker consistency; for example, A’s account may be debited while the overall transfer ultimately fails.

For deeper exploration, refer to the DTM project, which includes examples of successful and failed Saga executions and handling of various network exceptions.

TCC (Try‑Confirm‑Cancel)

The TCC pattern, first described by Pat Helland in 2007, divides a transaction into three phases:

Try : Perform all business checks and reserve necessary resources (soft isolation).

Confirm : Execute the actual business logic using the reserved resources; must be idempotent and retryable on failure.

Cancel : Release the reserved resources; also required to be idempotent.

In the transfer example, the Try phase freezes the amount, Confirm actually deducts it, and Cancel releases the freeze.

TCC transaction diagram
TCC transaction diagram

If Confirm or Cancel encounters a network fault and cannot return success, the TM continuously retries until the operation succeeds.

Characteristics of TCC:

Higher concurrency; no long‑term locks.

More development effort due to the need for three distinct interfaces.

Better consistency than Saga; the debit and credit are guaranteed to match.

Suited for order‑type business where intermediate states must be constrained.

Further study can be done via DTM.

Local Message Table

The local message table pattern, originally presented by eBay architect Dan Pritchett in 2008, uses asynchronous messages to ensure distributed tasks are eventually executed.

Workflow:

Local message table flow
Local message table flow

The business operation and message write occur within a single database transaction, guaranteeing atomicity.

Fault tolerance:

If the balance‑deduction transaction fails, it rolls back entirely.

If message production fails, both the balance‑addition transaction and message production are retried.

Characteristics:

Long transactions are split into multiple tasks; implementation is simple.

Requires an extra message table on the producer side.

Each message table needs polling.

Consumer failures that cannot be retried require additional rollback mechanisms.

Suitable for asynchronous business where subsequent steps do not need rollback.

Transaction Message

RocketMQ (from version 4.3) supports transaction messages, effectively moving the local message table to the message broker, simplifying the producer side.

Transaction message flow:

Send a half‑message.

Broker stores the message and returns the write result.

Based on the write result, execute the local transaction (if write fails, the half‑message is invisible and the local logic is skipped).

Commit or rollback the message depending on the local transaction outcome.

Transaction message flow
Transaction message flow

Compensation process: for pending messages, the broker initiates a “check‑back” request; the producer replies with Commit or Rollback based on the local transaction status.

Characteristics:

Long transactions are split into tasks with a simple check‑back interface.

Consumer retry failures still require additional rollback mechanisms.

Applicable to asynchronous business where later steps do not need rollback.

Maximum‑Effort Notification

This approach tries its best to notify the receiver of the business result, but the receiver must also be able to query the result if the notification fails.

Key points:

Provide an API for the receiver to query the processing result.

Message‑queue ACK mechanism with exponential back‑off intervals (1 min, 5 min, 10 min, 30 min, 1 h, 2 h, 5 h, 10 h) until a configured timeout.

Used in scenarios such as WeChat payment results, where both callback notifications and query APIs are employed.

AT Transaction Model

AT, an Alibaba open‑source project (Seata) mode, works like XA but automatically handles rollbacks without requiring explicit compensation code. It offers better performance than XA but still suffers from long‑term locks and can encounter dirty rollbacks.

Exception Handling

Network and business failures can occur at any stage of a distributed transaction; the system must ensure idempotency, empty‑rollback handling, and protection against hanging.

Exception Scenarios (illustrated with TCC)

Empty Rollback: Cancel is called without a preceding Try; the Cancel logic must detect this and return success.

Idempotency: All branches must be idempotent because duplicate requests can arise from network issues.

Hanging: Cancel executes before Try due to network delays; the system must prevent the Try from being applied after Cancel.

Network exception diagram
Network exception diagram

Request 4: Cancel before Try → handle empty rollback.

Request 6: Cancel repeats → ensure idempotency.

Request 8: Try after Cancel → handle hanging.

Current recommendations involve using a unique key to check whether an operation has already completed, which adds complexity and burden to the business logic.

Sub‑Transaction Barrier

The DTM project introduces a sub‑transaction barrier that filters abnormal requests and allows only valid ones to pass, automatically handling empty rollbacks, idempotency, and hanging.

Sub‑transaction barrier diagram
Sub‑transaction barrier diagram

The core API is:

func ThroughBarrierCall(db *sql.DB, transInfo *TransInfo, busiCall BusiFunc)

Developers implement their business logic in busiCall. ThroughBarrierCall guarantees that in empty‑rollback or hanging scenarios the business call is not executed, and that repeated calls are idempotent.

The barrier also works for Saga, transaction messages, and can be extended to other frameworks.

Barrier Implementation Details

A local table sub_trans_barrier stores a unique key composed of global‑transaction‑id, branch‑id, and branch‑type (try|confirm|cancel).

Start a DB transaction.

For a Try branch, attempt to INSERT IGNORE the unique key; if successful, execute the protected logic.

For a Confirm branch, INSERT IGNORE the confirm key; if successful, execute the protected logic.

For a Cancel branch, insert the try key first, then the cancel key; if the try key does not exist but the cancel key is inserted, the protected logic runs.

Commit or rollback the DB transaction based on the protected logic’s result.

This mechanism solves the three major network‑exception problems:

Empty compensation control : If Try never ran, Cancel’s insert succeeds without invoking business logic.

Idempotency control : The unique key prevents duplicate execution of any branch.

Hanging protection : If Cancel runs before Try, the Try insert fails, so the business logic is skipped.

The same principle applies to Saga and transaction‑message patterns.

Sub‑Transaction Barrier Summary

The barrier, first introduced by the DTM project, offers a simple algorithm and easy‑to‑use interface that frees developers from handling complex network‑exception logic.

It currently works with the DTM transaction manager and provides SDKs for Go and Python; other language SDKs are planned.

Conclusion

This article introduced the fundamentals of distributed transactions, explained common solutions such as XA, Saga, TCC, local message tables, and transaction messages, and provided an elegant approach to handling transaction anomalies through a sub‑transaction barrier.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MicroservicestccDistributed TransactionssagaXADTMtransaction patterns
ITFLY8 Architecture Home
Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.