Ensuring Distributed Final Consistency: Heavy and Light Approaches, Principles and Practices
The article explains distributed final consistency challenges, compares heavyweight transaction frameworks with lightweight techniques such as idempotency, retries, state machines, recovery logs, and async verification, and outlines CAP, BASE principles and practical implementation steps for backend systems.
Introduction
Distributed consistency issues appear everywhere; even a single machine can be seen as a tiny distributed system that relies on the operating system and database to guarantee eventual consistency. In large‑scale microservice architectures, ACID transactions are insufficient, and cross‑service calls must ensure distributed final consistency.
Current mainstream distributed‑transaction frameworks fall into three categories:
Two‑phase commit (2PC) based on the XA protocol
TCC (Try‑Confirm‑Cancel) originally proposed by Alipay
Message‑queue asynchronous assurance originally proposed by eBay
In addition, lighter solutions such as idempotency/retry, state machines, recovery logs, and asynchronous verification can be chosen according to business needs.
Heavy Weapons
Using a distributed‑transaction framework makes final consistency transparent to developers; the framework handles the complexity as an aspect. However, 2PC/TCC/Message‑queue solutions have drawbacks: protocol fragility, high latency due to multiple round‑trips, and the need for asynchronous validation to cover edge cases. Nevertheless, they greatly reduce development effort and are ideal for domains like banking that demand strict consistency and code quality.
Light Weapons
Programmers can also achieve final consistency through idempotency/retry, state machines, recovery logs, and asynchronous verification. These approaches are platform‑agnostic and lightweight, but they require developers to implement consistency logic themselves, which can introduce bugs.
The major distributed‑transaction frameworks themselves rely on these basic mechanisms, so the article focuses on the latter set of techniques.
Principles
1. CAP Theorem
The three properties are Consistency, Availability, and Partition tolerance; a distributed system can satisfy at most two simultaneously.
Consistency: every read receives the latest write or an error
Availability: every request receives a non‑error response
Partition tolerance: the system continues operating despite lost or delayed messages
Because partition tolerance is essential for scalability, most systems choose between Consistency and Availability.
2. BASE Principle
Basically Available
Soft state
Eventual consistency
BASE relaxes strong consistency in favor of AP, allowing applications to achieve eventual consistency through appropriate techniques; NoSQL databases like Cassandra follow BASE, while some systems (e.g., HBase) opt for CP.
Practices
1. Retry
Retry mechanisms automatically recover inconsistent data, provided the retried operation is idempotent. Two styles are common:
Sync retry: a failed request is re‑issued synchronously, which can waste threads and amplify traffic.
Async retry: failures are re‑processed via a message queue or scheduler, often with exponential back‑off and alerting after a threshold.
Retry also improves availability; for example, a service with 98% availability reaches 99.96% after one retry and 99.9992% after two retries.
2. Idempotency
Mathematically, idempotency means:
f(f(x)) = f(x)In practice, executing the same operation multiple times yields the same effect as a single execution. Techniques include:
Pessimistic lock via SELECT FOR UPDATE (requires AUTOCOMMIT=0, not suitable for high‑concurrency).
Optimistic lock with version numbers, e.g., UPDATE stocktable SET stock = stock - 1, version = version + 1 WHERE product_id = 123 AND version = 1 .
Unique‑index deduplication tables.
Global token checks using Redis atomic increments or MySQL unique indexes.
3. State Machine
A state machine models entity state transitions. Example: the lifecycle of a WeChat red‑packet.
INIT – creation after user submits amount and count.
SPLITTED – system splits amount into n parts.
PAID – asynchronous payment success.
NOTIFIED – notification sent to group.
RUNOUT – all packets claimed.
REFUND – timeout refund if not claimed.
Design state machines with minimal branching to simplify rollback and correction.
4. Recovery Log
Recovery logs record the full lifecycle of a request using a globally unique requestId :
REQUEST START requestId
For each modification, record (requestId, x, originalValue, destValue) .
REQUEST END requestId
These logs enable redo/undo operations to bring the system back to a consistent state after failures.
5. Asynchronous Verification
Beyond Paxos, most business scenarios add external constraints and async verification to detect anomalies and trigger recovery via scripts or manual correction. Verification can be:
Entity‑level: send business IDs to a verification service through a message queue.
Time‑window batch: schedule periodic checks based on timestamps.
If a reliable message queue or scheduler is unavailable, a local verification flag can be used, with the verification system updating the flag and retrying failed items.
Conclusion
Distributed final consistency is a frequent challenge for backend developers. In practice, a combination of idempotency, retry, asynchronous verification, state machines, and recovery logs is often required to achieve reliable consistency across microservices.
Hujiang Technology
We focus on the real-world challenges developers face, delivering authentic, practical content and a direct platform for technical networking among developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.