How to Handle Large MySQL Tables: Partitioning, Sharding, and Archiving Strategies
This article explains why large MySQL tables cause slow queries, analyzes table size, disk usage, and instance capacity, and presents three practical solutions—table partitioning, database sharding (horizontal/vertical), and hot‑cold data archiving—helping developers choose the appropriate strategy for their workloads.
Scenario
When the amount of data in a business database table grows, you may encounter long insert/query times, difficulty adding new columns, and the need to query only a time range of valid data.
Insertion and query latency become high.
Adding new fields for future business requirements has a large impact.
Only a subset of rows (e.g., within a time window) are actually needed.
Evaluating Table Data Volume
We can assess data volume from three aspects: table capacity, disk space, and instance capacity.
Table Capacity
Table capacity is evaluated by record count, average row length, growth rate, read/write throughput, and total size. For OLTP tables, it is recommended not to exceed 20 million rows or 15 GB total size, with read/write operations under 1,600 ops/s.
Typical SQL to count rows:
select count(*) from table;
select count(1) from table;When the data volume is huge, these queries may time out, so alternative methods are needed.
use database_name;
show table status like 'table_name'\G;
-- or
show table status like 'table_name'\G;The \G option formats the output, showing details such as engine, version, row count, and average row size.
Disk Space
Query the size of a specific database:
select table_schema as 'Database',
table_name as 'Table',
table_rows as 'Rows',
truncate(data_length/1024/1024,2) as 'Data Size (MB)',
truncate(index_length/1024/1024,2) as 'Index Size (MB)'
from information_schema.tables
order by data_length desc, index_length desc;The result shows each table's disk usage. It is recommended to keep disk usage below 70 % of capacity. For rapidly growing data, consider archiving to slower storage.
Instance Capacity
MySQL uses a thread‑based model; under high concurrency a single instance may not fully utilize CPU, limiting throughput. Different instance architectures may be needed based on workload.
Root Causes of Slow Queries on Large Tables
When a table reaches tens of millions or billions of rows, index effectiveness diminishes. Maintaining B‑tree indexes requires deeper tree levels, increasing disk I/O per lookup and slowing queries.
How many records can a B‑tree store?
InnoDB stores data in 16 KB pages. A B‑tree leaf stores rows; internal nodes store key + pointer. Assuming a 1 KB row, a leaf can hold 16 rows. With an 8‑byte bigint primary key and a 6‑byte pointer, an internal node can hold about 1,170 pointers. Thus a height‑2 B‑tree can store roughly 18,720 rows, while a height‑3 B‑tree can hold about 21.9 million rows, which satisfies most million‑level datasets.
How to Solve Large‑Table Query Performance Issues
Three common solutions are presented: table partitioning, database sharding (horizontal/vertical), and hot‑cold data archiving.
Solution 1: Table Partitioning
Partitioning splits a table into logical sections based on a condition, reducing the query range and improving index hit rates. The data still belongs to a single logical table but is stored in separate files.
-- Check if the server supports partitioning
show variables like '%partition%';
-- Example to create a range partitioned table
CREATE TABLE orders (
id BIGINT,
order_date DATE,
amount DECIMAL(10,2)
) PARTITION BY RANGE (YEAR(order_date)) (
PARTITION p2020 VALUES LESS THAN (2021),
PARTITION p2021 VALUES LESS THAN (2022),
PARTITION p2022 VALUES LESS THAN (2023)
);Benefits include easier data deletion, better query pruning, and parallel aggregation (e.g., COUNT, SUM) across partitions.
Limitations:
Maximum of 1,024 partitions per table.
Older MySQL versions require integer partition expressions.
All primary‑key or unique‑key columns must be part of the partition key.
Foreign keys are not supported on partitioned tables.
Both data and indexes are partitioned together.
Solution 2: Database Sharding (Horizontal & Vertical)
Sharding reduces the size of each table, lowering B‑tree height and I/O. Horizontal sharding splits rows across multiple tables/databases; vertical sharding splits columns.
Horizontal example: a 40 million‑row table is split into four tables of 10 million rows each.
Vertical example: separate frequently accessed columns from rarely used ones, storing them in different tables linked by primary key.
Sharding strategies:
Modulo (hash) sharding: target_table = user_id % 4 determines which table stores the row.
Range sharding: rows with IDs in a certain range go to a specific table.
Combined hash‑range: first hash to a database group, then range to a table within that group.
Advantages: balanced data distribution, reduced hotspot risk.
Disadvantages: data migration is hard when scaling out, and cross‑shard joins become impossible or costly.
Solution 3: Hot‑Cold Data Archiving
When data shows clear hot‑cold patterns (e.g., recent weeks are hot), move older “cold” data to separate archive tables or databases, keeping hot tables small and fast.
Create an archive table with the same schema as the original.
Initialize the archive with historical data.
Continuously move new cold data into the archive.
Provide a mechanism to query both hot and archived data when needed.
Choosing the Right Approach
Use partitioning when you need fine‑grained query pruning and parallel aggregation on a single logical table. Choose sharding when the overall data volume is too large for a single instance or when you need to scale write throughput across multiple servers. Apply hot‑cold archiving when the workload naturally separates recent active data from older, rarely accessed data.
Each technique can be combined; for example, sharding a set of partitioned tables or archiving within each shard.
Additional Considerations
Transaction management becomes complex across shards; distributed transaction frameworks may incur high overhead.
Cross‑shard joins are not supported directly; you may need to denormalize or perform multiple queries and merge results in the application.
Data management overhead increases: locating data, handling duplicate CRUD logic, and merging results for ranking queries.
By understanding the trade‑offs and the characteristics of your workload, you can select the most suitable solution to keep MySQL performant as data grows.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.