Handling Distributed Transaction Failures in Microservices: Blocking Retry, Async Queue, TCC, and Local Message Table
This article examines common strategies for handling inter‑service call failures in microservice architectures, comparing blocking retries, asynchronous queues, TCC compensation transactions, local message tables, and MQ‑based transactions, and discusses their advantages, drawbacks, and practical implementation considerations.
Introduction
In the era of distributed systems and microservice architectures, inter‑service call failures have become a norm. How to handle exceptions and guarantee data consistency is an unavoidable problem in microservice design.
Different business scenarios require different solutions. Common approaches include:
Blocking retry;
Traditional 2PC/3PC transactions;
Using a queue for asynchronous processing;
TCC compensation transactions;
Local message table (asynchronous assurance);
MQ transactions.
This article focuses on the methods other than 2PC/3PC, which have abundant online resources.
Blocking Retry
Blocking retry is a common technique in microservice architectures.
Pseudo‑code example:
m := db.Insert(sql)
err := request(B-Service,m)
func request(url string,body interface{}){
for i:=0; i<3; i ++ {
result, err = request.POST(url,body)
if err == nil {
break
}else {
log.Print()
}
}
}When the API call to service B fails, the request is retried up to three times; if all attempts fail, the error is logged and propagated.
This approach brings several problems:
Service B may process the request successfully, but due to a network timeout the caller treats it as a failure and retries, resulting in duplicate data.
If the call fails after the caller has already inserted a record into its own DB, that record becomes dirty data.
Retries increase latency for the upstream service and amplify pressure on the downstream service when load is high.
Solutions:
Make the B‑service API idempotent to solve the first problem.
Use background scheduled tasks to correct data, though this is not ideal.
Accept the added latency as a necessary trade‑off for higher consistency and availability.
Blocking retry is suitable only when the business is not sensitive to data consistency; otherwise additional mechanisms are required.
Asynchronous Queue
Introducing a queue is a common and effective evolution of the solution.
m := db.Insert(sql)
err := mq.Publish("B-Service-topic",m)After writing data to the DB, the service publishes a message to a MQ; an independent consumer processes the business logic. However, the MQ publish call can also fail (network issues, service crash), leading to the same problem as blocking retry: DB write succeeds but message publish fails.
In long‑running distributed systems, such failures are inevitable, making consistency a core design challenge.
TCC Compensation Transaction
When transactional guarantees are required and decoupling is difficult, TCC (Try‑Confirm‑Cancel) compensation transactions are a good choice.
TCC splits each service call into three phases:
Try: check business resources and reserve them (e.g., inventory check and pre‑deduction).
Confirm: commit the reserved resources (e.g., finalize inventory deduction).
Cancel: release the reserved resources if Try fails.
All services must implement these three APIs. Example pseudo‑code for an e‑commerce scenario involving inventory, payment, and points services:
m := db.Insert(sql)
aResult, aErr := A.Try(m)
bResult, bErr := B.Try(m)
cResult, cErr := C.Try(m)
if cErr != nil {
A.Cancel()
B.Cancel()
C.Cancel()
} else {
A.Confirm()
B.Confirm()
C.Confirm()
}TCC solves cross‑service data consistency but still faces issues such as empty releases, ordering problems, and failure of Confirm/Cancel calls.
Empty Release
If C.Try() truly fails, a subsequent C.Cancel() may attempt to release a resource that was never locked, leading to an “empty release”. Network glitches can cause the service to think the Try failed while the resource is actually locked, resulting in a permanent lock.
Ordering
Network latency may cause C.Cancel() to arrive before C.Try(), creating the same empty‑release issue. Services should reject Try calls after a successful Cancel, typically by using a unique transaction ID.
Call Failure
Both Cancel and Confirm can fail (e.g., network errors), leaving resources locked. Common mitigation strategies are blocking retry or logging the failure and handling it asynchronously, though these also have failure points.
Local Message Table
The local message table, originally proposed by eBay, stores messages in the same database as business data, allowing the use of local transactions to guarantee atomicity.
Implementation steps:
Insert business data and a corresponding message record within the same transaction.
If subsequent operations succeed, delete the message; otherwise, keep it and let an asynchronous listener retry.
Combined with MQ
Pseudo‑code example:
messageTx := tc.NewTransaction("order")
messageTxSql := tx.TryPlan("content")
m,err := db.InsertTx(sql,messageTxSql)
if err!=nil {
return err
}
aErr := mq.Publish("B-Service-topic",m)
if aErr!=nil { // MQ publish failed
messageTx.Confirm() // update message status to confirm
}else {
messageTx.Cancel() // delete message
}
// Asynchronous processing of confirm messages
func OnMessage(task *Task){
err := mq.Publish("B-Service-topic", task.Value())
if err==nil {
messageTx.Cancel()
}
}The SQL for inserting the local message table is:
insert into `tcc_async_task` (`uid`,`name`,`value`,`status`)
values ('?','?','?','?')This approach avoids the drawbacks of blocking retry while keeping the implementation simple.
Message Expiration
Handlers for Try and Confirm should check how long a message has been pending; if it exceeds a threshold (e.g., one hour), alert mechanisms such as email or SMS should trigger manual intervention.
Independent Message Service
An independent message service extracts the local message table into a separate service. Operations first add a message to the service; on success the message is deleted, otherwise it remains for later retry. This adds a “prepare” state to the message lifecycle.
MQ Transaction
Some MQ implementations (e.g., RocketMQ) support transactions, which are essentially a concrete form of the independent message service.
All operations first send a message to the MQ; on success the message is Confirmed, on failure it is Cancelled. The “prepare” state still requires the consumer to confirm business success.
Summary
In distributed systems, guaranteeing data consistency inevitably requires additional mechanisms.
TCC’s advantages are its business‑layer focus, database‑agnostic nature, and flexible resource‑locking granularity, making it suitable for microservices. Its drawbacks are the need for each service to implement three APIs and the complexity of handling various failure scenarios; mature frameworks (e.g., Alibaba’s Fescar) can reduce this cost.
The local message table is simple, does not depend on external services, and works well with service calls and MQ in most scenarios, though it couples the message table with business tables.
MQ transactions and independent message services decouple transaction handling into a dedicated service, avoiding per‑service message tables but suffer from limited MQ transaction support and added latency.
References:
TCC: https://www.sofastack.tech/blog/seata-tcc-theory-design-realization/
MQ Transaction: https://www.jianshu.com/p/eb571e4065ec
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.