Databases 10 min read

Analysis of Order Query Failures Caused by MySQL Master‑Slave Replication Lag and Mitigation Strategies

The e‑commerce platform’s order‑query alerts were caused by MySQL master‑slave replication lag—specifically, delayed redo‑log flushing let the slave receive binlog events before the master committed—so queries returned empty, a condition mitigated by retry logic, delayed MQ delivery, or emitting post‑commit business events.

DeWu Technology
DeWu Technology
DeWu Technology
Analysis of Order Query Failures Caused by MySQL Master‑Slave Replication Lag and Mitigation Strategies

The e‑commerce platform experienced a surge in order volume, leading to a split‑database architecture where buyer and seller orders are stored in separate databases. The seller database is populated by listening to the buyer database's binlog.

During normal operation, queries to the seller order main table succeed, but on the early morning of the 19th a large number of alerts indicated that the main table could not be found. The system retries via MQ, assuming the missing data is transient.

Analysis revealed three main possibilities:

Business impact : The retry mechanism eventually persisted the data, so there was no functional impact, only noisy alerts.

Master‑slave delay : The first query often returned empty because the master had not yet committed the transaction while the slave had already applied the binlog, suggesting replication lag.

The project uses a MySQL proxy (Rainbow Bridge) with read‑write separation. By default, reads are routed to the master; annotations can force reads from the slave. MyBatis interceptors pass routing hints to the proxy.

Data consistency during order creation : The possibility of saving child rows before the parent was dismissed because the entire operation runs within a single transaction.

Internal DB issues : Investigation focused on the redo‑log two‑phase commit. When IO pressure is high, flushing the redo log to disk can be delayed, causing the binlog to be sent to the slave before the transaction is committed on the master. Consequently, the subscriber (treated as a slave) receives the binlog and forwards a message while the master has not yet committed, leading to empty‑result queries.

Monitoring showed elevated IO on the slave during the incident, supporting this hypothesis.

The database operates in semi‑synchronous replication with two modes:

AFTER_COMMIT : The master commits first, then waits for the slave to acknowledge receipt of the relay log. If the master crashes before the slave writes the log, phantom reads or data loss may occur.

AFTER_SYNC : The master waits for the slave’s ACK before committing. This guarantees that once the master commits, the transaction is already persisted on the slave, avoiding phantom reads.

The system uses AFTER_SYNC, meaning the master waits for the slave’s ACK before committing. Under high IO, the slave’s relay‑log write is slow, so the subscriber receives the binlog before the master commits, explaining the missing‑order alerts.

Solution proposals include:

1. Retry mechanism : Leverage the existing MQ retry to re‑query after the transaction commits.

2. Delayed messages : Introduce a configurable delay (via MQ delayed messages) before delivering binlog events to the business service.

3. Do not rely on binlog : Emit business‑level events (e.g., order‑created messages) after the transaction commits, ensuring the data is available when downstream services query.

These approaches mitigate the alert noise without requiring changes to the core business logic.

DatabaseMySQLbinlogMaster‑SlaveReplicationRetry
DeWu Technology
Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.