How to Safely Update Billion‑Row MySQL Tables Without Overloading Binlog
This article explains why a simple full‑table UPDATE on massive MySQL tables can cripple master‑slave replication, analyzes deep‑pagination inefficiencies, and presents a step‑by‑step batch‑update strategy using NO_CACHE and FORCE INDEX to keep binlog size and buffer‑pool impact under control.
Preface
When a business iteration requires updating a whole MySQL table, small tables (tens of thousands of rows) can be updated directly, but once the data reaches a large scale (hundreds of millions or billions of rows) the binlog generated by row‑format replication can overwhelm the master‑slave synchronization, making a naïve UPDATE infeasible.
Our production MySQL uses row‑format binlog. Executing a full‑table UPDATE on a table with billions of rows would generate massive binlog entries, forcing the replica to process a huge amount of SQL and risking severe performance degradation.
Direct UPDATE Problems
We needed to replace the "http://" prefix with "https://" in a user image column for a table containing tens of millions of rows. The initial naive statement was:
update tb_user_info set user_img = replace(user_img, 'http://', 'https://');Storing full URL paths in the database is discouraged because any protocol or domain change then requires a massive batch update and unnecessarily inflates storage.
Deep Pagination Issue
Attempting to batch the update with LIMIT and an increasing offset leads to the classic deep‑pagination problem: MySQL must traverse the B‑tree leaf nodes for each large offset, resulting in near‑full‑table scans and poor performance.
update tb_user_info set user_img = replace(user_img, 'http://', 'https://') limit 1,1000;Efficiency of IN Clause
Fetching a list of IDs and then updating with an IN clause also performs poorly, even though MySQL can use some index predictions.
select * from tb_user_info where id > {index} limit 100;
update tb_user_info set user_img = replace(user_img, 'http', 'https') where id in ({id1,id2,id3});Final Solution
After several discussions with the DBA, we adopted a two‑step approach: first select IDs using /*!40001 SQL_NO_CACHE */ to avoid polluting the InnoDB buffer pool, and force the primary index for ordered retrieval. Then update rows by ID range.
select /*!40001 SQL_NO_CACHE */ id from tb_user_info FORCE INDEX(`PRIMARY`) where id > "1" order by id limit 1000,1;
update tb_user_info set user_img = replace(user_img, 'http', 'https') where id > "{1}" and id < "{2}";The SQL_NO_CACHE hint ensures the query does not use or fill the buffer pool, keeping cold data from evicting hot pages. FORCE INDEX(PRIMARY) guarantees the primary‑key index is used, and ordering by id allows range‑based updates.
Controlling the update rate via an API lets us monitor replication lag, IOPS, and memory usage, and adjust the throttling accordingly. The process can be parallelized with a thread pool to increase throughput while still respecting the rate limits.
Other Considerations
If primary keys are generated by Snowflake or auto‑increment, sequential inserts simplify range selection. With UUIDs, pre‑processing the data before insertion is required because batch updates become less predictable.
Conclusion
Large‑scale data updates are tedious, but collaborating with DBAs revealed several key MySQL insights:
Binlog format can cause massive replication pressure during bulk updates.
Deep pagination suffers severe performance degradation.
Using SQL_NO_CACHE prevents cold data from contaminating the buffer pool.
These techniques help safely refresh massive tables without jeopardizing production stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
