Databases 13 min read

Vertical vs Horizontal Sharding in MySQL: Strategy Selection and Middleware Guide

The article explains vertical and horizontal sharding concepts, compares their advantages and drawbacks, evaluates popular sharding middleware such as ShardingSphere, details distributed ID generation with Snowflake, offers cross‑database query solutions, and provides practical guidance on when and how to apply sharding in MySQL.

Senior Xiao Ying
Senior Xiao Ying
Senior Xiao Ying
Vertical vs Horizontal Sharding in MySQL: Strategy Selection and Middleware Guide

1. Vertical and Horizontal Sharding

1.1 Vertical Sharding

Concept: Split tables into different databases according to business functions, similar to data partitioning in micro‑service architectures.

Implementation:

Vertical database split: Separate by business module.

Original database: user_db
-- After split:
user_db: user, user_info, user_address
order_db: order, order_item, payment
product_db: product, category, inventory

Vertical table split: Split a wide table by columns.

-- Original user table
CREATE TABLE user (
  id BIGINT,
  username VARCHAR(50),
  password VARCHAR(100),
  email VARCHAR(100),
  phone VARCHAR(20),
  profile_text TEXT,   -- infrequently accessed
  preferences JSON,    -- infrequently accessed
  created_at TIMESTAMP
);

-- After split
CREATE TABLE user_base ( -- high‑frequency fields
  id BIGINT,
  username VARCHAR(50),
  password VARCHAR(100),
  email VARCHAR(100),
  phone VARCHAR(20),
  created_at TIMESTAMP
);

CREATE TABLE user_ext ( -- low‑frequency fields
  user_id BIGINT,
  profile_text TEXT,
  preferences JSON
);

Advantages:

Clear business boundaries and split rules

Facilitates micro‑service implementation

Reduces data volume per database

Enables hot‑cold data separation for performance

Disadvantages:

Cross‑database transactions become complex

Does not solve single‑table size explosion

Requires substantial code changes

1.2 Horizontal Sharding

Concept: Partition rows of the same table across multiple databases or tables using a sharding rule; each shard has identical schema.

Sharding strategies:

Range sharding: Split by ID or time range.

-- user table split into 4 databases by id range
user_0 (id: 1‑1000000)
user_1 (id: 1000001‑2000000)
user_2 (id: 2000001‑3000000)
user_3 (id: 3000001‑4000000)

Hash sharding: Distribute rows by hash value.

shard_id = user_id % 4
user_id % 4 = 0 → user_0
user_id % 4 = 1 → user_1
user_id % 4 = 2 → user_2
user_id % 4 = 3 → user_3

Consistent hash: Uses virtual nodes to minimise data migration during scaling.

Geographic sharding: Partition by region (e.g., North China → Beijing, East China → Shanghai).

Advantages:

Solves single‑table data‑size bottleneck

Improves query performance

Supports horizontal scaling

Disadvantages:

Sharding rules can be complex

Cross‑shard queries may be inefficient

Data migration and scaling are non‑trivial

2. Sharding Middleware Comparison

2.1 Mainstream Middleware Overview

2.2 ShardingSphere Details

Architecture components:

ShardingSphere-JDBC: lightweight Java framework (jar)
ShardingSphere-Proxy: transparent proxy service, standalone deployment
ShardingSphere-Sidecar: cloud‑native mode (planned)

YAML configuration example:

dataSources:
  ds_0:
    dataSourceClassName: com.zaxxer.hikari.HikariDataSource
    url: jdbc:mysql://localhost:3306/db0
    username: root
    password: root
  ds_1:
    dataSourceClassName: com.zaxxer.hikari.HikariDataSource
    url: jdbc:mysql://localhost:3306/db1
    username: root
    password: root

rules:
  - !SHARDING
    tables:
      user:
        actualDataNodes: ds_${0..1}.user_${0..1}
        tableStrategy:
          standard:
            shardingColumn: user_id
            shardingAlgorithmName: user_table_inline
        keyGenerateStrategy:
          column: user_id
          keyGeneratorName: snowflake
    shardingAlgorithms:
      user_table_inline:
        type: INLINE
        props:
          algorithm-expression: user_${user_id % 2}

2.3 Selection Advice

Proxy mode is suitable when:

Cross‑language support is required

Business code should remain unchanged

The team has middleware operation experience

Client (JDBC) mode is suitable when:

Maximum performance is the priority

The stack is Java‑centric

Some code intrusion is acceptable

3. Distributed ID Generation

3.1 Requirements

Global uniqueness – no ID collisions across the system

Monotonically increasing – benefits index performance

Strictly increasing – aids sorting needs

High availability – ID service must be reliable

Security – IDs should not expose business volume

3.2 Snowflake Algorithm Details

64‑bit ID structure:

0 - 0000000000 0000000000 0000000000 0000000000 0 - 00000 - 00000 - 000000000000
↑   ↑               ↑          ↑        ↑          ↑
sign timestamp(41) datacenter(5) machine(5) sequence(12)

Java implementation (core logic):

public class SnowflakeIdGenerator {
    // custom epoch
    private final long twepoch = 1609459200000L; // 2021‑01‑01 00:00:00
    // bit allocations
    private final long workerIdBits = 5L;
    private final long datacenterIdBits = 5L;
    private final long sequenceBits = 12L;
    // max values
    private final long maxWorkerId = -1L ^ (-1L << workerIdBits);
    private final long maxDatacenterId = -1L ^ (-1L << datacenterIdBits);
    // shifts
    private final long workerIdShift = sequenceBits;
    private final long datacenterIdShift = sequenceBits + workerIdBits;
    private final long timestampLeftShift = sequenceBits + workerIdBits + datacenterIdBits;
    private long sequence = 0L;
    private long lastTimestamp = -1L;
    public synchronized long nextId() {
        long timestamp = timeGen();
        if (timestamp < lastTimestamp) {
            throw new RuntimeException("Clock moved backwards");
        }
        if (lastTimestamp == timestamp) {
            sequence = (sequence + 1) & sequenceMask;
            if (sequence == 0) {
                timestamp = tilNextMillis(lastTimestamp);
            }
        } else {
            sequence = 0L;
        }
        lastTimestamp = timestamp;
        return ((timestamp - twepoch) << timestampLeftShift)
                | (datacenterId << datacenterIdShift)
                | (workerId << workerIdShift)
                | sequence;
    }
    // timeGen() and tilNextMillis() omitted for brevity
}

4. Cross‑Database Query Solutions

4.1 Challenges

JOIN operations cannot span multiple databases directly.

Sorting and pagination require merging results from many shards.

Aggregations (SUM, AVG, etc.) need global computation.

Distributed transactions make consistency hard.

4.2 Solutions

Global (broadcast) tables: Small dictionary or config tables are duplicated in every shard.

CREATE TABLE global_config (
    id INT PRIMARY KEY,
    config_key VARCHAR(50),
    config_value VARCHAR(255)
) ENGINE=InnoDB;

ER relationship binding: Bind related tables to use the same sharding rule.

shardingRule:
  bindingTables:
    - order, order_item  # both share the same sharding strategy

Multiple queries + in‑memory processing:

// 1. Query each shard separately
List<Order> shard0Orders = queryShard0(condition);
List<Order> shard1Orders = queryShard1(condition);
// 2. Merge, sort, paginate in memory
List<Order> allOrders = mergeAndSort(shard0Orders, shard1Orders);
List<Order> pageResult = paginate(allOrders, pageNo, pageSize);

Redundant fields: Store frequently needed fields (e.g., user_name) directly in the order table to avoid cross‑shard JOINs.

CREATE TABLE order (
    order_id BIGINT,
    user_id BIGINT,
    user_name VARCHAR(50),   -- redundant
    user_phone VARCHAR(20), -- redundant
    amount DECIMAL(10,2),
    status INT
);

Search‑engine sync: Sync MySQL binlog to Elasticsearch via Canal; query through ES to bypass cross‑shard joins.

MySQL shard → Binlog → Canal → Elasticsearch
Business queries read from ES

4.3 Pagination Optimization

Wrong approach: Query 1000 rows per shard, sort in memory, then take the last 10.

SELECT * FROM user ORDER BY create_time DESC LIMIT 10 OFFSET 990;

Business compromise: Restrict queries to a time range to reduce data per shard.

SELECT * FROM user WHERE create_time BETWEEN '2023-01-01' AND '2023-12-31' ORDER BY create_time DESC LIMIT 10 OFFSET 990;

Two‑stage query: First fetch top N IDs from each shard, merge and sort IDs in memory, then retrieve full rows for the final IDs.

-- Stage 1: fetch IDs
SELECT id FROM user_0 ORDER BY create_time DESC LIMIT 1000;
SELECT id FROM user_1 ORDER BY create_time DESC LIMIT 1000;
-- Merge IDs, pick top 10
-- Stage 2: fetch full rows
SELECT * FROM user WHERE id IN (id1, id2, ..., id10);

Cursor pagination: Use the last record of the previous page as the cursor.

SELECT * FROM user WHERE create_time < 'last_page_timestamp' ORDER BY create_time DESC LIMIT 10;

5. Practical Recommendations

5.1 When to Apply Sharding?

Table size exceeds 5 million rows (or is expected to within a year).

Complex query latency > 3 seconds.

QPS > 2000 and connection pool frequently saturated.

Single‑database storage > 500 GB.

5.2 Sharding Key Selection Principles

High query frequency – often appears in WHERE clauses.

Even data distribution – avoids hotspot skew.

Business relevance – keeps related data in the same shard.

Stability – once chosen, the key should not be altered.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

ShardingMySQLShardingSphereSnowflake IDVertical ShardingHorizontal ShardingCross-Database Queries
Senior Xiao Ying
Written by

Senior Xiao Ying

Dedicated to sharing Java backend technical experience and original tutorials, offering career transition advice and resume editing. Recognized as a rising star in CSDN's Java backend community and ranked Top 3 in the 2022 New Star Program for Java backend.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.