Databases 27 min read

Why MySQL Locks and Transaction Isolation Cause Data Loss in Concurrent Syncs

A detailed investigation reveals that MySQL's lock matrix and MVCC‑based transaction isolation caused a concurrent synchronization job to lose metric and dimension data, and proposes a distributed‑lock solution together with a long‑term read‑calc‑write refactor to eliminate the issue.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Why MySQL Locks and Transaction Isolation Cause Data Loss in Concurrent Syncs

Incident Overview

On 2025‑08‑13 a logical fact table attempted to synchronize its metrics and dimensions to an atomic service. Two nearly simultaneous sync requests caused part of the data to be missing.

Root Cause

Both requests executed the same "delete‑then‑insert" workflow. The second request’s DELETE was blocked by the first request’s transaction lock, but the second request’s subsequent SELECT read a snapshot taken before the first transaction’s delete became visible, so it incorrectly assumed the data was already present.

MySQL InnoDB Lock Mechanism

InnoDB uses a lock matrix rather than a single lock type:

Table‑level shared ( LOCK TABLES … READ) and exclusive ( LOCK TABLES … WRITE) locks.

Row‑level shared lock ( SELECT … FOR SHARE) and exclusive lock ( UPDATE/DELETE/INSERT).

Gap locks that block inserts into a range.

Next‑Key locks (row lock + gap lock) – the default for InnoDB.

The delete statement DELETE FROM unify_metric_impl WHERE logic_table_id IN (…) does not satisfy the left‑most index rule on idx_metric_def_id (metric_def_id, logic_table_id). Consequently MySQL escalates to a table‑level exclusive lock, forcing the second transaction to wait until the first releases the lock.

Transaction Isolation and MVCC

InnoDB’s default isolation level is REPEATABLE READ , which prevents dirty and non‑repeatable reads; phantom reads are avoided by Next‑Key locks. Each row stores three hidden fields ( DB_TRX_ID, DB_ROLL_PTR, DB_DELETED) and a read view ( m_ids, min_trx_id, max_trx_id, creator_trx_id) determines visibility. Snapshot reads see rows whose trx_idmin_trx_id and not listed in m_ids. Redo logs guarantee durability; undo logs enable rollback and MVCC reads.

Practical Code Analysis

@Transactional(rollbackFor = Exception.class)
public Map<String, Object> driveToAtomService(Map logicTableData, String erp) {
    // obtain environment
    String env = driveLogicTable.getString(DRIVE_LOGIC_TABLE_ENV);
    // get or add logic table id
    Long logicTableId = getOrAddLogicTableId(...);
    // delete old metric implementations (may acquire table‑level X lock)
    metricImplMapper.deleteByLogicTableIds(Collections.singletonList(logicId));
    // fetch current metric list and existing implementations
    List<MetricImplBO> metricList = getMetricImpls(...);
    List<MetricImplRelBO> metricImpls = metricImplMapper.getMetricImpls(logicTableId);
    Set<Long> metricDefIdSet = metricImpls.stream()
        .map(MetricImplRelBO::getMetricDefId).collect(Collectors.toSet());
    List<MetricImplBO> addList = metricList.stream()
        .filter(s -> !metricDefIdSet.contains(s.getMetricDefId()))
        .collect(Collectors.toList());
    // insert only new metric implementations
    addMetricImpl(addList);
}

When the first transaction runs, the delete acquires a table‑level X lock. The second transaction’s SELECT sees the pre‑delete version because its snapshot was taken before the delete became visible, leading to the apparent "non‑empty" result.

Solution Options

Distributed lock per logical table – low implementation effort, resolves the immediate conflict but retains long‑running transactions.

Force current read (SELECT … FOR UPDATE) – also low effort but creates long‑lasting locks; not recommended for high‑concurrency workloads.

Split the workflow into "read‑calc‑write" – read data without locks, compute differences in the application layer, then write only the delta in a short transaction. This requires refactoring but eliminates the root cause and reduces lock contention.

Current Status

The short‑term distributed‑lock approach has been deployed and the incident is resolved. Future iterations will migrate to the read‑calc‑write pattern to shrink transaction size and improve concurrency.

concurrencyMySQLDistributed Locktransaction isolationLocksMVCC
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.