Databases 21 min read

Why Split Databases? Master Sharding Concepts, Strategies, and Practical SQL Routing

This article explains the fundamental concepts of database sharding—including data nodes, logical and broadcast tables, sharding keys, strategies, algorithms, SQL parsing, routing, rewriting, execution, result merging, distributed primary keys, data masking, distributed transactions, data migration, and shadow databases—providing clear examples and code snippets for real‑world implementation.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
Why Split Databases? Master Sharding Concepts, Strategies, and Practical SQL Routing

Data Sharding

Sharding (horizontal partitioning) splits a large table such as t_order into multiple smaller tables across different databases (e.g., DB_1 and DB_2) to improve query performance for billion‑row datasets.

Data Nodes

A data node is the smallest indivisible unit in sharding, represented by a database name and a table name, such as DB_1.t_order_1 or DB_2.t_order_2.

Logical Table

A logical table is the name used in business SQL (e.g., t_order). During execution, the framework rewrites it to the actual physical tables ( t_order_0, t_order_1, …) based on sharding rules. select * from t_order where order_no='A11111' The real SQL after routing might become:

select * from DB_1.t_order_n where order_no='A11111'

Broadcast Table

A broadcast table has identical structure and data on every shard. It is ideal for small, rarely‑updated reference data (e.g., dictionary tables) and reduces JOIN network overhead.

Data is fully consistent across all shards.

Read queries need only one shard.

JOINs are always possible because data is identical.

Single Table

A single table exists only once across all shards, suitable for low‑volume data that does not require partitioning.

Sharding Key

The sharding key determines the target node for each row. For example, using order_no % 2 distributes orders between DB_1 and DB_2.

Sharding Strategy

A sharding strategy combines a sharding algorithm with one or more sharding keys to decide how data is allocated.

Sharding Algorithms

Hash Sharding : Uses the hash of the sharding key.

Range Sharding : Allocates based on value ranges.

Modulo Sharding : Uses key % N to pick a shard.

Binding Table

Binding tables share the same sharding rule, enabling efficient JOIN operations without cross‑shard queries (e.g., t_order and t_order_item bound by order_no).

SELECT * FROM t_order o JOIN t_order_item i ON o.order_no=i.order_no

SQL Parsing

SQL parsing consists of lexical analysis and syntax analysis, producing an abstract syntax tree (AST) that extracts fields, tables, conditions, order‑by, group‑by, limit, etc., and identifies rewrite points.

SELECT order_no FROM t_order WHERE order_status>0 AND user_id=10086

Executor Optimization

The optimizer reorders predicates based on statistics (e.g., using an indexed user_id first) to improve execution efficiency.

SELECT * FROM t_order WHERE user_id=10086 AND order_status>0

SQL Routing

Routing uses the sharding context and strategy to compute the target node(s). Types include:

Sharding routing (direct, standard, Cartesian).

Broadcast routing (full‑library, full‑instance, etc.).

Standard Routing

Recommended for most cases; works when the sharding key uses =. Range operators ( BETWEEN, IN) may produce multiple physical queries.

SELECT * FROM t_order WHERE t_order_id IN (1,2)
SELECT * FROM t_order_0 WHERE t_order_id IN (1,2)
SELECT * FROM t_order_1 WHERE t_order_id IN (1,2)

Direct Routing

Routes SQL directly to a specific database/table, useful when the sharding key is absent or for complex queries.

Cartesian Routing

Occurs when non‑binding tables are joined, leading to cross‑shard Cartesian products and poor performance.

SELECT * FROM t_order_0 t LEFT JOIN t_user_0 u ON u.user_id = t.user_id WHERE t.user_id=1
SELECT * FROM t_order_0 t LEFT JOIN t_user_1 u ON u.user_id = t.user_id WHERE t.user_id=1
...

SQL Rewrite

After routing, the logical table name is replaced with the actual physical table name (e.g., t_ordert_order_n).

SELECT * FROM t_order_n

SQL Execution

The rewritten SQL is sent to the appropriate data source(s) with resource‑aware connection management.

Result Merging

Results from multiple shards are merged, and operations such as sorting, grouping, pagination, and aggregation are applied on the merged set.

Distributed Primary Key

Sharding creates duplicate auto‑increment IDs across shards. To avoid collisions, a global unique ID (distributed ID) generated by a dedicated ID generator should be used.

Data Masking

Sensitive fields (e.g., name, address, phone) can be masked during sharding by applying encryption or randomization algorithms, protecting privacy while keeping data usable.

Distributed Transaction

Ensures atomicity across multiple data sources. Options include XA (strong consistency) and Seata (flexible consistency). Sharding increases transaction complexity, so avoid it unless necessary.

Data Migration

After sharding, data migration moves existing data to the new partitioned schema. It must handle data volume, consistency, and speed, typically using batch migration for historical data and dual‑write for incremental data.

Shadow Database

A shadow database mirrors the production schema and data for testing migrations, performance testing, or configuration changes without affecting live traffic.

Summary

This article introduced 21 core concepts of sharding architecture, laying the groundwork for deeper topics such as read/write separation, data masking, distributed primary keys, distributed transactions, configuration and registration centers, and proxy services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Data MigrationshardingDistributed TransactionsSQL Routingdatabase partitioningbroadcast tables
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.