Mastering Sharding: 21 Core Concepts for Scalable Database Design
This article introduces the essential concepts of sharding architecture—including data partitioning, logical and physical tables, sharding keys, routing strategies, SQL rewriting, distributed primary keys, data masking, transactions, migration, and shadow databases—providing a comprehensive guide for building high‑performance, horizontally scaled database systems.
Before diving into hands‑on implementation, it is important to grasp the common concepts of sharding architecture, which helps understand various sharding tools despite differences in their implementations.
Data Sharding
Sharding typically follows a horizontal splitting model. A large table such as
t_orderis divided into multiple smaller tables with identical structures, e.g.,
t_order_0,
t_order_1, …,
t_order_n, each storing a portion of the original data.
Data Nodes
A data node is the smallest indivisible unit in sharding, represented by a combination of data source name and table name, such as
DB_1.t_order_1or
DB_2.t_order_2.
Logical Table
A logical table is the abstract name shared by all horizontally split tables. Developers continue to write SQL against
t_order; before execution, the SQL is parsed and rewritten to target the appropriate physical tables.
<code>select * from t_order where order_no='A11111'</code>After routing, the real SQL might become:
<code>select * from DB_1.t_order_n where order_no='A11111'</code>Real Table
A real table is the physical table that actually exists in a database, e.g.,
DB_1.t_order_n.
Broadcast Table
A broadcast table has identical structure and data across all shards. It is suitable for small, rarely updated reference data such as dictionary tables, reducing JOIN overhead across shards.
Single Table
A single table exists only once across all shards and is used when the data volume is modest and sharding is unnecessary.
Sharding Key
The sharding key determines the target data node. For example, using
order_no % 2distributes orders between
DB_1and
DB_2. The same key is used for both sharding and routing.
Sharding Strategy
A sharding strategy combines a sharding algorithm with one or more sharding keys to decide how data is allocated to nodes.
Sharding Algorithm
Hash Sharding : Uses the hash of the sharding key.
Range Sharding : Allocates data based on value ranges.
Modulo Sharding : Applies
key % Nto select a node.
Binding Table
Binding tables share the same sharding rule, ensuring that related tables (e.g.,
t_orderand
t_order_item) are stored in the same shard, which avoids cross‑database JOINs.
<code>SELECT * FROM t_order o JOIN t_order_item i ON o.order_no=i.order_no</code>SQL Parsing
SQL parsing consists of lexical and syntactic analysis, producing a context that includes fields, tables, conditions, ORDER BY, GROUP BY, LIMIT, etc., and marks parts that may need rewriting.
<code>SELECT order_no FROM t_order WHERE order_status>0 AND user_id=10086</code>Executor Optimization
Based on statistics and indexes, the executor may reorder predicates to improve performance.
<code>SELECT * FROM t_order WHERE user_id=10086 AND order_status>0</code>SQL Routing
Using the parsed context and sharding strategy, the system computes the target data node(s). Routing types include sharding routing, broadcast routing, direct routing, and Cartesian product routing.
SQL Rewriting
After routing, logical table names are replaced with the actual physical table names.
<code>SELECT * FROM t_order_n</code>SQL Execution
The rewritten SQL is sent to the appropriate data source(s) with resource‑aware connection handling.
Result Merging
Results from multiple shards are merged, and operations such as sorting, grouping, pagination, and aggregation are performed on the merged set.
Distributed Primary Key
Because each shard generates its own auto‑increment IDs, collisions can occur. A global unique ID generator (a “snowflake” service) is used to assign IDs that are unique across all shards.
Data Masking
Sensitive fields (e.g., name, address, phone) can be masked during sharding by applying encryption or pseudorandom transformations before persisting the data.
Distributed Transaction
Cross‑shard transactions require either strong consistency protocols such as XA or flexible solutions like Seata to ensure atomicity across multiple data sources.
Data Migration
When migrating to a sharded architecture, both historical (offline) data and ongoing (online) data must be handled, often using batch migration for the former and dual‑write for the latter.
Shadow Database
A shadow database mirrors the production schema and data for testing schema changes, upgrades, or performance testing without affecting live traffic.
Summary
The article covered 21 fundamental concepts of sharding, laying the groundwork for deeper topics such as read/write separation, data masking, distributed primary keys, distributed transactions, configuration centers, service registries, and proxy services.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.