Databases 21 min read

Mastering Sharding: 21 Core Concepts for Scalable Database Design

This article introduces the essential concepts of sharding architecture—including data partitioning, logical and physical tables, sharding keys, routing strategies, SQL rewriting, distributed primary keys, data masking, transactions, migration, and shadow databases—providing a comprehensive guide for building high‑performance, horizontally scaled database systems.

macrozheng

Dec 28, 2023

Mastering Sharding: 21 Core Concepts for Scalable Database Design

Before diving into hands‑on implementation, it is important to grasp the common concepts of sharding architecture, which helps understand various sharding tools despite differences in their implementations.

Data Sharding

Sharding typically follows a horizontal splitting model. A large table such as t_order is divided into multiple smaller tables with identical structures, e.g., t_order_0, t_order_1, …, t_order_n, each storing a portion of the original data.

Data Nodes

A data node is the smallest indivisible unit in sharding, represented by a combination of data source name and table name, such as DB_1.t_order_1 or DB_2.t_order_2.

Logical Table

A logical table is the abstract name shared by all horizontally split tables. Developers continue to write SQL against t_order; before execution, the SQL is parsed and rewritten to target the appropriate physical tables. select * from t_order where order_no='A11111' After routing, the real SQL might become:

select * from DB_1.t_order_n where order_no='A11111'

Real Table

A real table is the physical table that actually exists in a database, e.g., DB_1.t_order_n.

Broadcast Table

A broadcast table has identical structure and data across all shards. It is suitable for small, rarely updated reference data such as dictionary tables, reducing JOIN overhead across shards.

Single Table

A single table exists only once across all shards and is used when the data volume is modest and sharding is unnecessary.

Sharding Key

The sharding key determines the target data node. For example, using order_no % 2 distributes orders between DB_1 and DB_2. The same key is used for both sharding and routing.

Sharding Strategy

A sharding strategy combines a sharding algorithm with one or more sharding keys to decide how data is allocated to nodes.

Sharding Algorithm

Hash Sharding : Uses the hash of the sharding key.

Range Sharding : Allocates data based on value ranges.

Modulo Sharding : Applies key % N to select a node.

Binding Table

Binding tables share the same sharding rule, ensuring that related tables (e.g., t_order and t_order_item) are stored in the same shard, which avoids cross‑database JOINs.

SELECT * FROM t_order o JOIN t_order_item i ON o.order_no=i.order_no

SQL Parsing

SQL parsing consists of lexical and syntactic analysis, producing a context that includes fields, tables, conditions, ORDER BY, GROUP BY, LIMIT, etc., and marks parts that may need rewriting.

SELECT order_no FROM t_order WHERE order_status>0 AND user_id=10086

Executor Optimization

Based on statistics and indexes, the executor may reorder predicates to improve performance.

SELECT * FROM t_order WHERE user_id=10086 AND order_status>0

SQL Routing

Using the parsed context and sharding strategy, the system computes the target data node(s). Routing types include sharding routing, broadcast routing, direct routing, and Cartesian product routing.

SQL Rewriting

After routing, logical table names are replaced with the actual physical table names.

SELECT * FROM t_order_n

SQL Execution

The rewritten SQL is sent to the appropriate data source(s) with resource‑aware connection handling.

Result Merging

Results from multiple shards are merged, and operations such as sorting, grouping, pagination, and aggregation are performed on the merged set.

Distributed Primary Key

Because each shard generates its own auto‑increment IDs, collisions can occur. A global unique ID generator (a “snowflake” service) is used to assign IDs that are unique across all shards.

Data Masking

Sensitive fields (e.g., name, address, phone) can be masked during sharding by applying encryption or pseudorandom transformations before persisting the data.

Distributed Transaction

Cross‑shard transactions require either strong consistency protocols such as XA or flexible solutions like Seata to ensure atomicity across multiple data sources.

Data Migration

When migrating to a sharded architecture, both historical (offline) data and ongoing (online) data must be handled, often using batch migration for the former and dual‑write for the latter.

Shadow Database

A shadow database mirrors the production schema and data for testing schema changes, upgrades, or performance testing without affecting live traffic.

Summary

The article covered 21 fundamental concepts of sharding, laying the groundwork for deeper topics such as read/write separation, data masking, distributed primary keys, distributed transactions, configuration centers, service registries, and proxy services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Sharding SQL Routing database partitioning Broadcast Table

Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.