Databases 8 min read

Why Master‑Slave Replication Lags 5‑7 AM and How a Big‑Data Snapshot Fixes It

The article analyzes why the master‑slave database replication experiences 30‑minute delays each morning between 5 AM and 7 AM, traces the cause to massive inventory‑snapshot jobs, evaluates several mitigation options, and details a big‑data extraction workflow that eliminates the lag while reducing disk usage.

dbaplus Community
dbaplus Community
dbaplus Community
Why Master‑Slave Replication Lags 5‑7 AM and How a Big‑Data Snapshot Fixes It

Background

As business volume grew, the delay between the primary and replica databases increased, with some instances showing more than 30 minutes of lag, especially in the JX cluster during the 05:00‑07:00 window.

Root‑Cause Analysis

Data‑feature analysis revealed that the delay coincides with a daily inventory‑snapshot worker that runs at 05:00. The worker creates a full snapshot of inventory data used for reporting, reconciliation, and traceability. In the JX and KA clusters, the snapshot generates 6.9 × 10⁸ rows (≈4.842 trillion items) each day, with single‑layer inventory rows reaching millions.

The snapshot is built with INSERT INTO … SELECT FROM … LIMIT … queries. Although the system has 152 data sources, each source is executed serially with id‑based paging, and a gradient‑interval throttling is applied to limit load on a single DB instance.

During the worker execution, massive binlog traffic is produced, causing high write load, heavy disk I/O, and long replication time, which explains the observed master‑slave lag.

Impact Analysis

The lag makes replica data inaccurate, affecting report queries and big‑data extraction that rely on the replica. Order processing is not directly impacted. Additionally, retaining 20 days of snapshots (≈5 TB, 138 billion rows) keeps disk utilization high on each DB instance.

Mitigation Options

1. Deploy a dedicated snapshot database instance isolated from the production database.

2. Increase sharding granularity to spread data across more instances.

3. Use SQL management tools to import/export snapshots.

4. Enable writeset replication (hash‑based transaction identification, e.g., XXHASH64) to improve parallelism; requires transaction_write_set_extraction=XXHASH64 and binlog_transaction_dependency_tracking=WRITESET.

5. Implement a big‑data extraction pipeline that writes daily snapshots to Hive and then to Elasticsearch (ES).

Options 1‑4 improve parallelism or isolate workloads but still face the inherent bottleneck of the SQL management tool. The team selected option 5.

Solution Implementation (Big‑Data Extraction)

Build a BDP workflow that runs at 05:00, creating an offline snapshot for each of the 10 tasks (each handling 16 databases).

Replace the old fdm table with a new offline table fdm_jdl_scm_wms_stock_st_stock_st_stock_dayly_st, migrating dependent jobs.

Plan disk capacity and provision ES shards, writing one new ES partition per day.

Configure Hive‑to‑ES jobs to transfer snapshot data from the Hive warehouse to ES.

Switch backend reporting and export services to query ES via EasyData, which wraps ES queries behind a SQL‑like interface.

Stop the original snapshot writer in the SQL management tool.

After 20 days of ES data accumulation, migrate report queries and exports fully to the ES data source.

Clean old snapshot tables in the SQL management tool, defragment disks, and release space.

Results

On the day the snapshot writer was stopped, the long‑standing master‑slave lag disappeared; subsequent monitoring confirmed only a statistical discrepancy, not real lag. Disk utilization on the production DB fell below 60 %, and the ES store now holds daily snapshots with a 20‑day retention policy, automatically deleting older partitions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ElasticsearchHivedatabase replicationmaster-slave delayinventory snapshotbig data extraction
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.