How to Efficiently Scan Billions of Rows in MySQL: Pagination Strategies
This article examines the challenges of querying massive follower lists stored in MySQL, compares limit‑based pagination with tag‑record and range‑limit methods, and offers practical indexing and query‑optimization recommendations for handling tables containing hundreds of millions of rows.
1. System Overview and Problem Description
How can we traverse and query over a hundred million rows in MySQL? The system maintains follow relationships between JD users and business objects, providing queries such as a user’s followed items or whether a user follows a specific item. A new requirement asks for a fan‑list interface for a target object, but some objects have fan counts in the tens of millions or even over a hundred million.
All fan data is stored in MySQL, sharded by business object ID into 16 shards and 256 tables. Each table can hold more than 200 million rows, and all fans of the same object are routed to the same table.
2. Solution Ideas and Methods
Database table structure example:
CREATE TABLE follow_fans_[0-255] (
id BIGINT(11) NOT NULL AUTO_INCREMENT COMMENT 'auto‑increment id',
biz_content VARCHAR(50) DEFAULT NULL COMMENT 'business object ID',
source VARCHAR(50) DEFAULT NULL COMMENT 'source',
pin VARCHAR(50) DEFAULT NULL COMMENT 'user pin',
ext VARCHAR(5000) DEFAULT NULL COMMENT 'extra info',
status TINYINT(2) DEFAULT 1 COMMENT 'status, 0=invalid, 1=normal',
created_time DATETIME DEFAULT NULL COMMENT 'creation time',
modified_time DATETIME DEFAULT NULL COMMENT 'modification time',
PRIMARY KEY(id),
UNIQUE INDEX uniq_biz_content_pin (biz_content, pin)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='follow fans table';Limit‑Based Pagination
Using LIMIT offset, n works for small fan counts but degrades sharply as offset grows because MySQL must scan offset + n rows and discard the first offset. Example:
SELECT id, biz_content, pin FROM follow_fans_1 WHERE biz_content = #{bizContent} ORDER BY id DESC LIMIT 10, 10;Pros: Simple implementation, supports jump‑page queries.
Cons: Performance worsens with deeper pages on large data sets.
Tag‑Record Method
Record the last retrieved primary‑key ID (maxId) and use it as a cursor for the next page, eliminating large offsets:
SELECT id, biz_content, pin FROM follow_fans_1 WHERE biz_content = #{bizContent} AND id < #{lastId} ORDER BY id DESC LIMIT 10;Pros: Avoids deep‑offset slowdown; tests show tens of milliseconds per page even with tens of millions of rows.
Cons: Only sequential forward pagination, no random page jumps; final page may still require full‑table scan if rows < page size.
Range‑Limit Method
Combine the tag‑record approach with a lower bound (minId) to define a closed interval, allowing the engine to stop scanning at the end of the range:
SELECT id, biz_content, pin FROM follow_fans_1 WHERE biz_content = #{bizContent} AND id < #{lastId} AND id >= #{minId} ORDER BY id DESC LIMIT 10;First obtain minId:
SELECT MIN(id) FROM follow_fans_1 WHERE biz_content = #{bizContent};Because the initial MIN(id) query can time out on tables with billions of rows, the solution caches minId after an asynchronous computation. If the cache misses, an async task fetches minId and stores it with an expiration. For very large tables, an offline big‑data job periodically computes minId for each biz_content and pushes the result to the cache.
Pros: Keeps per‑page latency in the tens of milliseconds even for hundred‑million‑row tables.
Cons: Requires sequential pagination, depends on cache and offline jobs, and still risks timeout when minId must be computed on‑fly.
3. Reflections on SQL Optimization Governance
Each method has trade‑offs; the best choice depends on actual data volume and business constraints. General SQL optimization recommendations include:
Ensure Indexes on Query Conditions
MySQL uses clustered (primary key) and secondary indexes. A primary key automatically becomes the clustered index. If no primary key exists, the first non‑null unique index becomes clustered; otherwise MySQL creates a hidden row‑id.
Use Indexes Correctly to Avoid Index Loss
Follow the left‑most prefix rule; stop matching when a range condition appears.
Equality (=) and IN predicates can appear in any order within a composite index.
Prefer high‑cardinality columns for indexing.
Do not wrap indexed columns in functions or type casts.
Extend existing indexes rather than creating many new ones.
Reduce Selected Columns to Avoid Back‑Table Lookups
Covering indexes (index‑only scans) eliminate the need to fetch rows from the table after locating the primary key. When EXPLAIN shows Using index in the Extra column, the query is covered; otherwise Using index condition indicates a back‑table lookup.
JD Cloud Developers
JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
