Databases 8 min read

How to Eliminate 30‑Minute Master‑Slave Lag in High‑Volume Inventory Systems

This article analyzes why a warehouse management system’s master‑slave database replication lagged up to 30 minutes during nightly inventory snapshot generation, evaluates several mitigation strategies, and details the chosen big‑data‑driven solution that moved snapshots to Elasticsearch, reducing lag and disk usage.

JD Tech
JD Tech
JD Tech
How to Eliminate 30‑Minute Master‑Slave Lag in High‑Volume Inventory Systems

Background

As business volume grew, the delay between the primary and replica databases in the JX cluster increased, with some instances experiencing over 30 minutes of lag, especially during the nightly inventory snapshot generation between 05:00‑07:00.

Cause Analysis

Data analysis showed that the lag coincides with the automatic generation of daily inventory snapshots, which are essential for analysis, traceability, and reconciliation. The JX cluster processes massive data volumes—up to 6.9 billion rows (484.2 billion items) per day—creating snapshots via INSERT INTO … SELECT FROM … LIMIT … across 152 data sources. Parallel execution across sources, heavy binlog generation, high I/O, and large transaction volumes all contribute to prolonged replication time.

Impact Analysis

The lag makes replica data inaccurate, causing reporting and big‑data extraction queries that rely on the replica to return incorrect results. While it does not directly affect order processing, it degrades reporting, export, and analytical workflows. Additionally, retaining 20 days of snapshots consumes about 5 TB of disk space, keeping disk utilization high on each database instance.

Solution Options

1. Deploy a dedicated snapshot database instance isolated from production.

2. Increase sharding granularity via finer‑grained partitioning.

3. Use SQL management tools to import/export snapshot data daily.

4. Enable writeset in the SQL tool to improve parallel replication (requires transaction_write_set_extraction=XXHASH64 and binlog_transaction_dependency_tracking=WRITESET).

5. Leverage big‑data extraction to store snapshots in Elasticsearch.

Options 1‑4 focus on SQL‑level optimizations, but the replication bottleneck remains difficult to overcome. The team selected option 5.

Chosen Solution: Big‑Data‑Driven Snapshot Storage

Daily inventory snapshots are extracted offline using the BDP workflow, creating a new partition (dt) for each day. The data is then written from Hive to Elasticsearch (ES) via the hive2ES connector. Each day creates a new ES shard, retaining 20 days of data; older shards are deleted automatically.

Implementation Steps

1. Build 5‑point offline extraction tasks in BDP workflow

Split the collection into 10 tasks, each handling 16 databases.

2. Switch the existing inventory snapshot table to a new offline table

Identify dependent jobs and replace the old table with fdm_jdl_scm_wms_stock_st_stock_st_stock_dayly_st.

3. Plan disk capacity and ES shard allocation

Assess daily data volume and allocate a new ES shard per day.

4. Configure Hive‑to‑ES extraction jobs

Set up hive2ES tasks to write daily snapshots into new ES partitions.

5. Update backend services to query/export from ES

Use EasyData to wrap ES queries, allowing SQL‑style access for reporting.

6. Stop the old snapshot cron job after ES holds 20 days of data

Since the past 20 days of data already exist in ES, the switch can be performed without a data gap.

7. Migrate reporting and export queries to ES data source

After the 20‑day window, the web reports fully transition to ES.

8. Clean up old snapshot tables and reclaim disk space

Delete snapshot data from the SQL management tool’s production database and defragment disks.

Results

On the first day after stopping the SQL snapshot writes, the long‑standing master‑slave lag disappeared; remaining latency differences were due to monitoring metric variations, not actual delay. Disk utilization on the primary database dropped below 60%, and the system now operates within healthy thresholds.

ElasticsearchHiveDatabase Replicationinventory snapshotmaster-slave lag
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.