Why a Missing Commit Crashed Our Payment System – A Deep Dive into Spring Transaction Pitfalls
A payment service showed successful orders but no data in the database, leading to lock timeouts; the root cause was a hidden missing commit in a special code path that polluted the connection pool, causing other services to inherit an uncommitted transaction and fail intermittently.
Incident Overview
Last week the online payment system started failing: users completed payments, yet the order table remained empty and occasional lock‑timeout errors appeared. The DBA noticed several transactions that never committed, locking rows in the order table.
Symptoms
Business code executed without errors and logs showed normal operation.
Log entries indicated commit was called, but no data was persisted.
Most of the time the operation failed, but occasionally it succeeded.
Emergency Fix
The immediate response was to restart the application, forcing all connections to be released. After the restart the payment flow worked again, but the underlying bug still needed to be identified.
Root Cause Discovery
Reviewing the recent deployment revealed a new business method that forgot to call commit in a special branch:
@Service
public class SomeService {
public void handleSpecialCase() {
// start transaction
sqlSession.connection.setAutoCommit(false);
// execute SQL
mapper.insert(data);
// special case: return without commit!
if (specialCondition) {
return; // commit missed here
}
sqlSession.commit();
}
}The return caused the transaction to end without a commit or rollback, leaving the ConnectionHolder marked as an active transaction.
Quick Fix
Adding an explicit commit before the early return resolved the issue:
@Service
public class SomeService {
public void handleSpecialCase() {
TransactionStatus status = transactionManager.getTransaction(new DefaultTransactionDefinition());
try {
mapper.insert(data);
if (specialCondition) {
transactionManager.commit(status); // ensure commit
return;
}
transactionManager.commit(status);
} catch (Exception e) {
transactionManager.rollback(status);
throw e;
}
}
}Why Did It Affect Unrelated Services?
After the fix, a deeper investigation showed that Spring’s doGetTransaction reuses the same physical connection from the TransactionSynchronizationManager. When the previous method returned without committing, the ConnectionHolder still had isTransactionActive() = true. The next request (e.g., PaymentService.createOrder) fetched this polluted connection, and isExistingTransaction reported an existing transaction. Consequently, the new service joined the stale transaction instead of starting a fresh one.
The relevant code paths are:
protected Object doGetTransaction() {
DataSourceTransactionObject txObject = new DataSourceTransactionObject();
txObject.setSavepointAllowed(this.isNestedTransactionAllowed());
ConnectionHolder conHolder = (ConnectionHolder) TransactionSynchronizationManager.getResource(this.obtainDataSource());
txObject.setConnectionHolder(conHolder, false);
return txObject;
}
protected boolean isExistingTransaction(Object transaction) {
DataSourceTransactionObject txObject = (DataSourceTransactionObject) transaction;
return txObject.hasConnectionHolder() && txObject.getConnectionHolder().isTransactionActive();
}Because the connection was still marked as active, the subsequent service treated the request as part of an existing transaction. When it later called processCommit, the framework checked status.isNewTransaction(), which was false, so the actual connection.commit() never executed. The uncommitted data remained in the connection pool, contaminating further requests.
Why Was It Intermittent?
TransactionSynchronizationManageris thread‑local. If a request was handled by a clean thread (no polluted ConnectionHolder), the transaction succeeded. If it ran on a thread that inherited the polluted connection, the bug manifested. This explains the non‑deterministic behavior.
Prevention Measures
1. Connection‑Pool Health Checks
spring:
datasource:
hikari:
connection-test-query: SELECT 1
validation-timeout: 3000
connection-init-sql: SET autocommit=1These settings ensure each borrowed connection is validated and reset, preventing polluted connections from being reused.
2. Monitoring & Alerts
Add SQL to detect long‑running transactions and set up alerts:
SELECT * FROM information_schema.innodb_trx
WHERE TIME_TO_SEC(TIMEDIFF(NOW(), trx_started)) > 30;3. Explicit Transaction Management
Always place commit at the end of the try block.
Put rollback in the catch block.
Release resources in a finally block.
4. Database‑Level Monitoring
Track slow queries, long transactions, lock waits, and connection counts to catch issues that application logs may miss.
5. Debug Source Code When Issues Appear
Set breakpoints in getTransaction and isExistingTransaction to see how connections are reused. Relying solely on documentation can hide subtle bugs.
Takeaways
Connection pools are not just performance optimizations; they can propagate transaction state bugs.
Manual transaction code must clearly handle commit and rollback.
Monitoring must cover both application and database layers.
Debugging with real breakpoints often reveals problems that static code reading cannot.
By fixing the missing commit, adding pool validation, and improving monitoring, the team eliminated the intermittent payment failures and strengthened the overall reliability of the system.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
