Why Does Database Master‑Slave Replication Lag Every Day at 5‑7 AM?
The article investigates why master‑slave replication delay spikes each morning between 05:00 and 07:00, tracing it to the inventory‑snapshot worker that floods binlog traffic, evaluates five mitigation strategies, implements a big‑data extraction pipeline to Elasticsearch, and reports that the nightly delay disappeared and disk utilization improved.
Background
As business volume grew, the delay between the primary and replica databases increased, reaching over 30 minutes for some instances, especially in the JX cluster.
Cause Analysis
Data‑feature analysis showed the delay concentrates between 05:00‑07:00. A daily inventory‑snapshot worker creates a full snapshot of stock data, which is essential for analysis, traceability, and reconciliation in the WMS.
In the JX cluster each day 6.9 × 10⁸ rows (484.2 billion items) are generated. The snapshot is produced with INSERT INTO … SELECT FROM … LIMIT … across 152 logical databases. Execution for each data source is serial with paging, while multiple sources run in parallel, leading to high load.
The worker generates massive binlog traffic; the primary‑replica sync therefore handles a large data volume, high disk I/O, and long synchronization time, which explains the observed high replication lag.
Impact Analysis
During the lag window, data in the replica is inaccurate, causing reporting queries and big‑data extraction to return wrong results. Order processing is not directly affected. Additionally, retaining 20 days of snapshots (≈1.38 × 10¹⁰ rows, ~5 TB) keeps disk utilization high on each database instance.
Solution Options
Option 1: Deploy a dedicated snapshot database instance isolated from the production primary.
Option 2: Increase sharding granularity, spreading data across more partitioned instances.
Option 3: Use the SQL management tool to export/import snapshot data daily.
Option 4: Enable writeset replication to improve parallelism (hash‑based transaction identification, parameters transaction_write_set_extraction=XXHASH64 and binlog_transaction_dependency_tracking=WRITESET).
Option 5: Employ a big‑data extraction pipeline (BDP) to store snapshots in Elasticsearch, creating a daily partition and retaining 20 days.
After evaluation, Option 5 was selected because the other approaches could not break the fundamental bottleneck of the SQL management tool.
Special Governance Steps
Build offline extraction tasks at 5 AM : Using a BDP workflow, split the job into 10 tasks, each handling 16 databases.
Switch to a new offline table : Replace the old fdm table with fdm_jdl_scm_wms_stock_st_stock_st_stock_dayly_st.
Plan disk capacity and ES shards : Allocate one new ES shard per day based on the estimated data volume.
Configure Hive‑to‑ES extraction jobs : Set up daily Hive jobs that write snapshot data into Elasticsearch partitions.
Update reporting to use ES : The web layer now queries Elasticsearch through EasyData (Web → Controller → EasyData → ES), allowing SQL‑style queries.
Stop the old snapshot scheduler : After 20 days of ES data are ready, the nightly snapshot task is disabled.
Switch reporting queries to ES : Once the 20‑day window is complete, all report queries and exports use the ES data source.
Clean old snapshot data : Delete snapshot tables from the SQL management tool, defragment disks, and free space.
Results
The day after stopping the snapshot writes, the long‑standing nightly master‑replica lag vanished. Monitoring confirmed only a statistical discrepancy, not real latency. Disk utilization of the production database fell below 60 %, indicating a healthy level.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
