Databases 21 min read

Mastering Sharding: 21 Core Concepts for Scalable Database Design

This article introduces the essential concepts of sharding architecture—including data partitioning, logical and physical tables, sharding keys, routing strategies, SQL rewriting, distributed primary keys, data masking, transactions, migration, and shadow databases—providing a comprehensive guide for building high‑performance, horizontally scaled database systems.

macrozheng
macrozheng
macrozheng
Mastering Sharding: 21 Core Concepts for Scalable Database Design

Before diving into hands‑on implementation, it is important to grasp the common concepts of sharding architecture, which helps understand various sharding tools despite differences in their implementations.

Data Sharding

Sharding typically follows a horizontal splitting model. A large table such as

t_order

is divided into multiple smaller tables with identical structures, e.g.,

t_order_0

,

t_order_1

, …,

t_order_n

, each storing a portion of the original data.

Data Nodes

A data node is the smallest indivisible unit in sharding, represented by a combination of data source name and table name, such as

DB_1.t_order_1

or

DB_2.t_order_2

.

Logical Table

A logical table is the abstract name shared by all horizontally split tables. Developers continue to write SQL against

t_order

; before execution, the SQL is parsed and rewritten to target the appropriate physical tables.

<code>select * from t_order where order_no='A11111'</code>

After routing, the real SQL might become:

<code>select * from DB_1.t_order_n where order_no='A11111'</code>

Real Table

A real table is the physical table that actually exists in a database, e.g.,

DB_1.t_order_n

.

Broadcast Table

A broadcast table has identical structure and data across all shards. It is suitable for small, rarely updated reference data such as dictionary tables, reducing JOIN overhead across shards.

Single Table

A single table exists only once across all shards and is used when the data volume is modest and sharding is unnecessary.

Sharding Key

The sharding key determines the target data node. For example, using

order_no % 2

distributes orders between

DB_1

and

DB_2

. The same key is used for both sharding and routing.

Sharding Strategy

A sharding strategy combines a sharding algorithm with one or more sharding keys to decide how data is allocated to nodes.

Sharding Algorithm

Hash Sharding : Uses the hash of the sharding key.

Range Sharding : Allocates data based on value ranges.

Modulo Sharding : Applies

key % N

to select a node.

Binding Table

Binding tables share the same sharding rule, ensuring that related tables (e.g.,

t_order

and

t_order_item

) are stored in the same shard, which avoids cross‑database JOINs.

<code>SELECT * FROM t_order o JOIN t_order_item i ON o.order_no=i.order_no</code>

SQL Parsing

SQL parsing consists of lexical and syntactic analysis, producing a context that includes fields, tables, conditions, ORDER BY, GROUP BY, LIMIT, etc., and marks parts that may need rewriting.

<code>SELECT order_no FROM t_order WHERE order_status>0 AND user_id=10086</code>

Executor Optimization

Based on statistics and indexes, the executor may reorder predicates to improve performance.

<code>SELECT * FROM t_order WHERE user_id=10086 AND order_status>0</code>

SQL Routing

Using the parsed context and sharding strategy, the system computes the target data node(s). Routing types include sharding routing, broadcast routing, direct routing, and Cartesian product routing.

SQL Rewriting

After routing, logical table names are replaced with the actual physical table names.

<code>SELECT * FROM t_order_n</code>

SQL Execution

The rewritten SQL is sent to the appropriate data source(s) with resource‑aware connection handling.

Result Merging

Results from multiple shards are merged, and operations such as sorting, grouping, pagination, and aggregation are performed on the merged set.

Distributed Primary Key

Because each shard generates its own auto‑increment IDs, collisions can occur. A global unique ID generator (a “snowflake” service) is used to assign IDs that are unique across all shards.

Data Masking

Sensitive fields (e.g., name, address, phone) can be masked during sharding by applying encryption or pseudorandom transformations before persisting the data.

Distributed Transaction

Cross‑shard transactions require either strong consistency protocols such as XA or flexible solutions like Seata to ensure atomicity across multiple data sources.

Data Migration

When migrating to a sharded architecture, both historical (offline) data and ongoing (online) data must be handled, often using batch migration for the former and dual‑write for the latter.

Shadow Database

A shadow database mirrors the production schema and data for testing schema changes, upgrades, or performance testing without affecting live traffic.

Summary

The article covered 21 fundamental concepts of sharding, laying the groundwork for deeper topics such as read/write separation, data masking, distributed primary keys, distributed transactions, configuration centers, service registries, and proxy services.

distributed systemsShardingSQL routingdatabase partitioningbroadcast table
macrozheng
Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.