Databases 14 min read

How to Handle Billion‑Row MySQL Tables: Performance Tests, Indexing, and Sharding Strategies

This article explores MySQL performance when storing and querying billions of financial transaction records, detailing table creation, massive data generation, the impact of indexes on query and insert speed, pagination limits, and a practical time‑based sharding approach for bank‑level cash‑flow logs.

ITPUB
ITPUB
ITPUB
How to Handle Billion‑Row MySQL Tables: Performance Tests, Indexing, and Sharding Strategies

The author investigates MySQL performance and handling of massive (up to 100 million rows) cash‑flow data, a typical scenario for bank transaction logs, and shares practical experiments, pitfalls, and architectural recommendations.

Table definition and data generation

A cash‑flow table yun_cashflow is created with eight columns (id, userid, type, operatoruserid, withdrawdepositid, money, runid, createtime) and an InnoDB engine.

DROP TABLE IF EXISTS `yun_cashflow`;
CREATE TABLE `yun_cashflow` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `userid` int(11) DEFAULT NULL,
  `type` int(11) DEFAULT NULL COMMENT '1、入账,2提现',
  `operatoruserid` int(11) DEFAULT NULL COMMENT '操作员ID',
  `withdrawdepositid` bigint(20) DEFAULT NULL COMMENT '提现ID',
  `money` double DEFAULT NULL COMMENT '钱数',
  `runid` bigint(20) DEFAULT NULL COMMENT '工单ID',
  `createtime` timestamp NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=63 DEFAULT CHARSET=utf8;

To populate the table, a stored procedure test_insert inserts 10 000 rows per call using a loop and random values. The initial approach proved very slow (≈40‑60 rows/s), prompting a bulk‑insert technique that concatenates many VALUES rows in a single statement, dramatically increasing insert speed.

Data size and baseline query performance

Inserting 100 million rows results in a data file of about 7.78 GB. Without any indexes, simple SELECT queries (time range, user filter, money filter, or their combinations) take roughly 1‑2 minutes, confirming that full‑table scans on such a dataset are impractical.

Adding indexes

Indexes are added on userid, money, and createtime:

ALTER TABLE yun_cashflow ADD INDEX index_userid (userid);
ALTER TABLE yun_cashflow ADD INDEX index_money (money);
ALTER TABLE yun_cashflow ADD INDEX index_createtime (createtime);

Creating these indexes on the 7.8 GB table takes about 1 400 seconds (≈23 minutes). After indexing, queries that return a small result set (e.g., a few dozen rows) execute in sub‑second time. However, when the result set grows beyond a few million rows, query time rises proportionally (e.g., 1 000 000 rows ≈ 1 second, 10 000 000 rows ≈ several seconds).

Pagination impact

Using LIMIT offset, count shows that performance degrades as the offset increases: low offsets (<1 000) are fast, but offsets in the millions become noticeably slower, making deep pagination unsuitable for very large tables.

Insert speed with indexes

With indexes in place, insert speed drops to about 20 rows per second, confirming that maintaining indexes during heavy writes can be a bottleneck.

Sharding strategy

To mitigate both read and write bottlenecks, the author proposes time‑based sharding: split the massive table into multiple tables (e.g., one million rows per table) and maintain a routing table CreateTimeIndexTable that maps time ranges to the appropriate shard.

CreateTimeIndexTable
ID  TableName        CreateTimeStart          CreateTimeEnd
1   yun_cashflow_1   2018-10-22 09:06:58      2018-10-26 09:06:58
2   yun_cashflow_2   2018-10-26 09:06:58      2018-10-29 09:06:58
3   yun_cashflow_3   2018-11-12 09:06:58      2018-11-22 09:06:58
4   yun_cashflow_4   2018-11-22 09:06:58      2018-11-26 09:06:58

Query routing works in two steps: first query the index table to find the relevant shard(s), then query the actual shard(s). If multiple shards match, results are merged in the application layer.

Multi‑threaded query considerations

Running many concurrent connections (e.g., 100 threads) on a single MySQL instance quickly exhausts connection limits, leading to timeouts. The author suggests using asynchronous coordination mechanisms (e.g., Java CountDownLatch or Akka) to aggregate results from parallel shard queries.

Practical recommendations for bank transaction logs

Separate write‑heavy operation tables from read‑heavy historical tables; keep indexes only on the historical table.

Periodically move older data from the operation table to the appropriate time‑sharded historical tables (e.g., nightly batch).

Index the most selective fields (userid, createtime, money) to accelerate common queries.

Limit query time ranges (e.g., three months to one year) to keep result sets manageable.

Use pagination with reasonable offsets; avoid deep offsets on a single massive table.

Consider a sharding scheme of ~1 million rows per shard to keep index rebuild times short.

Overall, the experiments demonstrate that raw MySQL can handle hundred‑million‑row tables, but careful indexing, sharding, and query design are essential to achieve acceptable performance for banking transaction workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceindexingmysqlpaginationLarge Data
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.