Databases 21 min read

Why JED’s Lock Mechanism Caused Data Loss and How Distributed Locks Can Fix It

An in‑depth post‑mortem of a JED database incident reveals how its lock matrix and MVCC isolation caused metric data loss, explains the underlying lock granularity, transaction isolation levels, and MVCC visibility rules, and proposes short‑term distributed‑lock and long‑term read‑calc‑write solutions.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Why JED’s Lock Mechanism Caused Data Loss and How Distributed Locks Can Fix It

1. Theory from Practice

The incident on 2025‑08‑13 showed metric and dimension data disappearing when a fact‑logic table synced to an atomic service. The root cause was two nearly simultaneous syncs that conflicted due to JED’s lock mechanism and MVCC read‑view inconsistencies.

2. Rewind to the Incident

The sync logic follows a "delete‑then‑insert" pattern: delete old rows → compare new vs. old → insert new rows. The core Java method

@Transactional(rollbackFor = Exception.class) public Map<String,Object> driveToAtomService(Map logicTableData, String erp) { … }

performs environment lookup, metric ID handling, deletion of related metrics, and batch insertion.

During the incident, 15 metrics and 64 dimensions were involved. The delete operation of request 2 was blocked by request 1’s long transaction, while request 2’s read incorrectly saw existing rows, violating the delete‑then‑insert logic.

3. Deep Dive into Conclusions

3.1 Lock Mechanism

JED does not use a single lock but a lock matrix with different granularities:

Table‑level : S‑lock (rare), X‑lock (rare), intention locks (IS/IX).

Row‑level : S‑lock via SELECT … FOR SHARE, X‑lock via UPDATE/DELETE/INSERT (default).

Gap‑level : Gap S‑lock (prevents inserts), Gap X‑lock (prevents other inserts).

Next‑Key : Row + gap lock (default InnoDB algorithm).

A decision‑logic diagram (image) summarizes which lock type applies.

3.2 Transaction Theory

ACID is a causal chain, not four independent properties. Consistency is the core; atomicity, isolation, and durability support it.

Isolation levels in JED:

RU (Read Uncommitted): no protection, minimal cost.

RC (Read Committed): prevents dirty reads, low cost.

RR (Repeatable Read, default): prevents dirty and non‑repeatable reads, medium cost; phantom reads are solved by Next‑Key locks.

Serializable: prevents all concurrency anomalies, high cost.

MVCC stores three hidden fields per row (DB_TRX_ID, DB_ROLL_PTR, DB_DELETED) and uses a read‑view consisting of m_ids, min_trx_id, max_trx_id, and creator_trx_id to decide visibility.

Visibility rules:

If db_trx_id == creator_trx_id, the row is visible (self‑modification).

If db_trx_id < min_trx_id, the row is visible (committed before the snapshot).

If db_trx_id >= max_trx_id, the row is invisible (created after the snapshot).

If min_trx_id ≤ db_trx_id < max_trx_id and db_trx_id is in m_ids, invisible (still active); otherwise visible.

Current reads (SELECT … FOR UPDATE) lock the latest version.

Transaction logs (redo and undo) guarantee durability and atomicity while enabling fast sequential writes.

3.3 Practical Findings

Two concurrent transactions were simulated:

-- Transaction 1
begin;
select * from unify_metric_impl where logic_table_id = 45631;
delete from unify_metric_impl where logic_table_id in (45631);
-- insert many rows …
commit;
-- Transaction 2
begin;
select * from unify_metric_impl where logic_table_id = 45631;
delete from unify_metric_impl where logic_table_id in (45631);
commit;

Transaction 1’s SELECT returned empty because it saw its own delete version (V2). Transaction 2’s SELECT saw the previous committed version (V1) because its read‑view was taken before Transaction 1’s delete became visible.

4. Solutions

Three mitigation strategies were evaluated:

Option 1 – Distributed lock per logical table : low implementation cost, short‑term fix, but still leaves long transactions.

Option 2 – Force current read (SELECT … FOR UPDATE) in Transaction 2 : low cost but creates long‑running locks and is not recommended.

Option 3 – Split long transaction into "read‑calc‑write" : read data without locks, compute differences in application code, then write only the delta in short transactions; higher refactor cost but provides a long‑term solution.

The short‑term distributed‑lock approach has already been applied, and the read‑calc‑write refactor is planned for future releases.

5. Appendix

Definitions:

Fact‑logic table : a semantic view joining fact tables and dimension tables, serving as the source of metric data.

Atomic service : an implementation of a metric; a metric may have multiple implementations.

distributed-systemslockingdatabasesTransaction ManagementMVCC
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.