Vertical vs Horizontal Sharding in MySQL: Strategy Selection and Middleware Guide
The article explains vertical and horizontal sharding concepts, compares their advantages and drawbacks, evaluates popular sharding middleware such as ShardingSphere, details distributed ID generation with Snowflake, offers cross‑database query solutions, and provides practical guidance on when and how to apply sharding in MySQL.
1. Vertical and Horizontal Sharding
1.1 Vertical Sharding
Concept: Split tables into different databases according to business functions, similar to data partitioning in micro‑service architectures.
Implementation:
Vertical database split: Separate by business module.
Original database: user_db
-- After split:
user_db: user, user_info, user_address
order_db: order, order_item, payment
product_db: product, category, inventoryVertical table split: Split a wide table by columns.
-- Original user table
CREATE TABLE user (
id BIGINT,
username VARCHAR(50),
password VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20),
profile_text TEXT, -- infrequently accessed
preferences JSON, -- infrequently accessed
created_at TIMESTAMP
);
-- After split
CREATE TABLE user_base ( -- high‑frequency fields
id BIGINT,
username VARCHAR(50),
password VARCHAR(100),
email VARCHAR(100),
phone VARCHAR(20),
created_at TIMESTAMP
);
CREATE TABLE user_ext ( -- low‑frequency fields
user_id BIGINT,
profile_text TEXT,
preferences JSON
);Advantages:
Clear business boundaries and split rules
Facilitates micro‑service implementation
Reduces data volume per database
Enables hot‑cold data separation for performance
Disadvantages:
Cross‑database transactions become complex
Does not solve single‑table size explosion
Requires substantial code changes
1.2 Horizontal Sharding
Concept: Partition rows of the same table across multiple databases or tables using a sharding rule; each shard has identical schema.
Sharding strategies:
Range sharding: Split by ID or time range.
-- user table split into 4 databases by id range
user_0 (id: 1‑1000000)
user_1 (id: 1000001‑2000000)
user_2 (id: 2000001‑3000000)
user_3 (id: 3000001‑4000000)Hash sharding: Distribute rows by hash value.
shard_id = user_id % 4
user_id % 4 = 0 → user_0
user_id % 4 = 1 → user_1
user_id % 4 = 2 → user_2
user_id % 4 = 3 → user_3Consistent hash: Uses virtual nodes to minimise data migration during scaling.
Geographic sharding: Partition by region (e.g., North China → Beijing, East China → Shanghai).
Advantages:
Solves single‑table data‑size bottleneck
Improves query performance
Supports horizontal scaling
Disadvantages:
Sharding rules can be complex
Cross‑shard queries may be inefficient
Data migration and scaling are non‑trivial
2. Sharding Middleware Comparison
2.1 Mainstream Middleware Overview
2.2 ShardingSphere Details
Architecture components:
ShardingSphere-JDBC: lightweight Java framework (jar)
ShardingSphere-Proxy: transparent proxy service, standalone deployment
ShardingSphere-Sidecar: cloud‑native mode (planned)YAML configuration example:
dataSources:
ds_0:
dataSourceClassName: com.zaxxer.hikari.HikariDataSource
url: jdbc:mysql://localhost:3306/db0
username: root
password: root
ds_1:
dataSourceClassName: com.zaxxer.hikari.HikariDataSource
url: jdbc:mysql://localhost:3306/db1
username: root
password: root
rules:
- !SHARDING
tables:
user:
actualDataNodes: ds_${0..1}.user_${0..1}
tableStrategy:
standard:
shardingColumn: user_id
shardingAlgorithmName: user_table_inline
keyGenerateStrategy:
column: user_id
keyGeneratorName: snowflake
shardingAlgorithms:
user_table_inline:
type: INLINE
props:
algorithm-expression: user_${user_id % 2}2.3 Selection Advice
Proxy mode is suitable when:
Cross‑language support is required
Business code should remain unchanged
The team has middleware operation experience
Client (JDBC) mode is suitable when:
Maximum performance is the priority
The stack is Java‑centric
Some code intrusion is acceptable
3. Distributed ID Generation
3.1 Requirements
Global uniqueness – no ID collisions across the system
Monotonically increasing – benefits index performance
Strictly increasing – aids sorting needs
High availability – ID service must be reliable
Security – IDs should not expose business volume
3.2 Snowflake Algorithm Details
64‑bit ID structure:
0 - 0000000000 0000000000 0000000000 0000000000 0 - 00000 - 00000 - 000000000000
↑ ↑ ↑ ↑ ↑ ↑
sign timestamp(41) datacenter(5) machine(5) sequence(12)Java implementation (core logic):
public class SnowflakeIdGenerator {
// custom epoch
private final long twepoch = 1609459200000L; // 2021‑01‑01 00:00:00
// bit allocations
private final long workerIdBits = 5L;
private final long datacenterIdBits = 5L;
private final long sequenceBits = 12L;
// max values
private final long maxWorkerId = -1L ^ (-1L << workerIdBits);
private final long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);
// shifts
private final long workerIdShift = sequenceBits;
private final long datacenterIdShift = sequenceBits + workerIdBits;
private final long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;
private long sequence = 0L;
private long lastTimestamp = -1L;
public synchronized long nextId() {
long timestamp = timeGen();
if (timestamp < lastTimestamp) {
throw new RuntimeException("Clock moved backwards");
}
if (lastTimestamp == timestamp) {
sequence = (sequence + 1) & sequenceMask;
if (sequence == 0) {
timestamp = tilNextMillis(lastTimestamp);
}
} else {
sequence = 0L;
}
lastTimestamp = timestamp;
return ((timestamp - twepoch) << timestampLeftShift)
| (datacenterId << datacenterIdShift)
| (workerId << workerIdShift)
| sequence;
}
// timeGen() and tilNextMillis() omitted for brevity
}4. Cross‑Database Query Solutions
4.1 Challenges
JOIN operations cannot span multiple databases directly.
Sorting and pagination require merging results from many shards.
Aggregations (SUM, AVG, etc.) need global computation.
Distributed transactions make consistency hard.
4.2 Solutions
Global (broadcast) tables: Small dictionary or config tables are duplicated in every shard.
CREATE TABLE global_config (
id INT PRIMARY KEY,
config_key VARCHAR(50),
config_value VARCHAR(255)
) ENGINE=InnoDB;ER relationship binding: Bind related tables to use the same sharding rule.
shardingRule:
bindingTables:
- order, order_item # both share the same sharding strategyMultiple queries + in‑memory processing:
// 1. Query each shard separately
List<Order> shard0Orders = queryShard0(condition);
List<Order> shard1Orders = queryShard1(condition);
// 2. Merge, sort, paginate in memory
List<Order> allOrders = mergeAndSort(shard0Orders, shard1Orders);
List<Order> pageResult = paginate(allOrders, pageNo, pageSize);Redundant fields: Store frequently needed fields (e.g., user_name) directly in the order table to avoid cross‑shard JOINs.
CREATE TABLE order (
order_id BIGINT,
user_id BIGINT,
user_name VARCHAR(50), -- redundant
user_phone VARCHAR(20), -- redundant
amount DECIMAL(10,2),
status INT
);Search‑engine sync: Sync MySQL binlog to Elasticsearch via Canal; query through ES to bypass cross‑shard joins.
MySQL shard → Binlog → Canal → Elasticsearch
Business queries read from ES4.3 Pagination Optimization
Wrong approach: Query 1000 rows per shard, sort in memory, then take the last 10.
SELECT * FROM user ORDER BY create_time DESC LIMIT 10 OFFSET 990;Business compromise: Restrict queries to a time range to reduce data per shard.
SELECT * FROM user WHERE create_time BETWEEN '2023-01-01' AND '2023-12-31' ORDER BY create_time DESC LIMIT 10 OFFSET 990;Two‑stage query: First fetch top N IDs from each shard, merge and sort IDs in memory, then retrieve full rows for the final IDs.
-- Stage 1: fetch IDs
SELECT id FROM user_0 ORDER BY create_time DESC LIMIT 1000;
SELECT id FROM user_1 ORDER BY create_time DESC LIMIT 1000;
-- Merge IDs, pick top 10
-- Stage 2: fetch full rows
SELECT * FROM user WHERE id IN (id1, id2, ..., id10);Cursor pagination: Use the last record of the previous page as the cursor.
SELECT * FROM user WHERE create_time < 'last_page_timestamp' ORDER BY create_time DESC LIMIT 10;5. Practical Recommendations
5.1 When to Apply Sharding?
Table size exceeds 5 million rows (or is expected to within a year).
Complex query latency > 3 seconds.
QPS > 2000 and connection pool frequently saturated.
Single‑database storage > 500 GB.
5.2 Sharding Key Selection Principles
High query frequency – often appears in WHERE clauses.
Even data distribution – avoids hotspot skew.
Business relevance – keeps related data in the same shard.
Stability – once chosen, the key should not be altered.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Senior Xiao Ying
Dedicated to sharing Java backend technical experience and original tutorials, offering career transition advice and resume editing. Recognized as a rising star in CSDN's Java backend community and ranked Top 3 in the 2022 New Star Program for Java backend.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
