Efficient Full‑Table Updates in MySQL: Handling Row‑Based Binlog, Deep Pagination, and Batch Processing
The article explains why executing a full‑table UPDATE on large MySQL tables using row‑based binlog can overload replication, discusses the inefficiencies of deep pagination and IN clauses, and presents a batch‑processing strategy that leverages SQL_NO_CACHE and FORCE INDEX to safely and efficiently migrate data.
When a business needs to migrate data—such as converting stored URLs from http to https —a full‑table UPDATE works for small tables but becomes problematic for tables with hundreds of millions of rows. In a master‑slave MySQL setup that uses row‑based binlog, each row change generates a large amount of binlog data, which then has to be streamed to the replica, causing high I/O, replication lag, and risk of failure.
Binlog Formats
Format
Description
statement
Logs the original SQL statement; low log volume but can break with nondeterministic functions (e.g.,
RAND()).
row
Logs each row’s before/after image; high log volume for bulk
DELETE/
UPDATE.
mixed
Combines both; usually statement for simple DML and row for complex statements.
Because the production environment uses the row format, a naïve full‑table UPDATE would flood the master’s binlog and block the replica.
Direct UPDATE Problems
An initial attempt used a simple statement:
update tb_user_info set user_img=replace(user_img,'http://','https://');This generated massive binlog traffic and was deemed unacceptable.
Deep Pagination Issue
Trying to batch the work with LIMIT offsets (e.g., LIMIT 1,1000 ) leads to “deep pagination” where MySQL must traverse the B‑tree leaf nodes for each offset, resulting in near‑full‑table scans and poor performance.
IN‑Clause Inefficiency
Another idea was to collect a batch of IDs and update with WHERE id IN (...) . Although MySQL can use index lookups for the IN list, the approach remains slow for large batches.
Final Batch‑Processing Solution
After discussions with DBAs, the following strategy was adopted:
Use SELECT /*!40001 SQL_NO_CACHE */ id FROM tb_user_info FORCE INDEX(`PRIMARY`) WHERE id>"1" ORDER BY id LIMIT 1000,1; to fetch a small window of primary‑key IDs without polluting the InnoDB buffer pool.
Update the range with UPDATE tb_user_info SET user_img=replace(user_img,'http','https') WHERE id > "{start}" AND id < "{end}";
The SQL_NO_CACHE hint forces the query to bypass the buffer pool, preventing cold data from evicting hot pages. FORCE INDEX(PRIMARY) guarantees the primary‑key index is used, and ordering by id ensures sequential access.
select /*!40001 SQL_NO_CACHE */ id from tb_user_info FORCE INDEX(`PRIMARY`) where id> "1" ORDER BY id limit 1000,1;
update tb_user_info set user_img=replace(user_img,'http','https') where id > "{1}" and id < "{2}";This method allows the update rate to be throttled via an API, monitoring replication lag, IOPS, and memory usage. The process can be parallelized with a thread pool to increase throughput while still controlling the overall load.
Other Considerations
If primary keys are generated by Snowflake or auto‑increment, rows are naturally ordered, simplifying range scans. For UUID primary keys, a pre‑migration step to rewrite IDs or to perform a separate data‑pre‑processing phase is required.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.