Databases 9 min read

Why Does Database Master‑Slave Replication Lag Every Day at 5‑7 AM?

The article investigates why master‑slave replication delay spikes each morning between 05:00 and 07:00, tracing it to the inventory‑snapshot worker that floods binlog traffic, evaluates five mitigation strategies, implements a big‑data extraction pipeline to Elasticsearch, and reports that the nightly delay disappeared and disk utilization improved.

ITPUB
ITPUB
ITPUB
Why Does Database Master‑Slave Replication Lag Every Day at 5‑7 AM?

Background

As business volume grew, the delay between the primary and replica databases increased, reaching over 30 minutes for some instances, especially in the JX cluster.

Cause Analysis

Data‑feature analysis showed the delay concentrates between 05:00‑07:00. A daily inventory‑snapshot worker creates a full snapshot of stock data, which is essential for analysis, traceability, and reconciliation in the WMS.

In the JX cluster each day 6.9 × 10⁸ rows (484.2 billion items) are generated. The snapshot is produced with INSERT INTO … SELECT FROM … LIMIT … across 152 logical databases. Execution for each data source is serial with paging, while multiple sources run in parallel, leading to high load.

The worker generates massive binlog traffic; the primary‑replica sync therefore handles a large data volume, high disk I/O, and long synchronization time, which explains the observed high replication lag.

Impact Analysis

During the lag window, data in the replica is inaccurate, causing reporting queries and big‑data extraction to return wrong results. Order processing is not directly affected. Additionally, retaining 20 days of snapshots (≈1.38 × 10¹⁰ rows, ~5 TB) keeps disk utilization high on each database instance.

Solution Options

Option 1: Deploy a dedicated snapshot database instance isolated from the production primary.

Option 2: Increase sharding granularity, spreading data across more partitioned instances.

Option 3: Use the SQL management tool to export/import snapshot data daily.

Option 4: Enable writeset replication to improve parallelism (hash‑based transaction identification, parameters transaction_write_set_extraction=XXHASH64 and binlog_transaction_dependency_tracking=WRITESET).

Option 5: Employ a big‑data extraction pipeline (BDP) to store snapshots in Elasticsearch, creating a daily partition and retaining 20 days.

After evaluation, Option 5 was selected because the other approaches could not break the fundamental bottleneck of the SQL management tool.

Special Governance Steps

Build offline extraction tasks at 5 AM : Using a BDP workflow, split the job into 10 tasks, each handling 16 databases.

Switch to a new offline table : Replace the old fdm table with fdm_jdl_scm_wms_stock_st_stock_st_stock_dayly_st.

Plan disk capacity and ES shards : Allocate one new ES shard per day based on the estimated data volume.

Configure Hive‑to‑ES extraction jobs : Set up daily Hive jobs that write snapshot data into Elasticsearch partitions.

Update reporting to use ES : The web layer now queries Elasticsearch through EasyData (Web → Controller → EasyData → ES), allowing SQL‑style queries.

Stop the old snapshot scheduler : After 20 days of ES data are ready, the nightly snapshot task is disabled.

Switch reporting queries to ES : Once the 20‑day window is complete, all report queries and exports use the ES data source.

Clean old snapshot data : Delete snapshot tables from the SQL management tool, defragment disks, and free space.

Results

The day after stopping the snapshot writes, the long‑standing nightly master‑replica lag vanished. Monitoring confirmed only a statistical discrepancy, not real latency. Disk utilization of the production database fell below 60 %, indicating a healthy level.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Elasticsearchdatabase replicationwritesetmaster‑slave delayinventory snapshotbig data extractionBDP
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.