Databases 19 min read

Understanding Sharding: 21 Common Concepts of Database Partitioning

This article introduces the fundamental concepts of database sharding, including data fragmentation, nodes, logical and physical tables, broadcast and single tables, sharding keys, strategies, algorithms, binding tables, SQL parsing, routing, rewriting, execution, result merging, distributed primary keys, data masking, distributed transactions, migration, and shadow databases, providing a comprehensive guide for designing and operating a sharded architecture.

Wukong Talks Architecture

May 16, 2023

Understanding Sharding: 21 Common Concepts of Database Partitioning

The article begins by explaining why a well‑designed system may need to be split into multiple databases and tables, using the example of an t_order table that has grown to billions of rows, which is divided into two databases ( DB_1, DB_2) and further into tables ( t_order_1, t_order_2).

Data Sharding (horizontal partitioning) splits a large table like t_order into many smaller tables ( t_order_0, t_order_1, …, t_order_n) that share the same schema.

Data Nodes are the smallest indivisible units in sharding, identified by a data source name and table name, e.g., DB_1.t_order_1 and DB_2.t_order_2.

Logical Table is the abstract name that represents a set of horizontally split tables with identical structure, such as t_order being represented by t_order_0 … t_order_9. Application code continues to use the logical name; the sharding layer rewrites it to the actual physical table at execution time.

Example of business‑logic SQL and the rewritten real SQL:

select * from t_order where order_no='A11111'

select * from DB_1.t_order_n where order_no='A11111'

Real Table refers to the physical table that actually exists in a database, e.g., DB_1.t_order_n.

Broadcast Table is a small table whose schema and data are duplicated across all shards, typically used for dictionary or configuration data to avoid cross‑shard joins. Modifications must be synchronized across all shards.

Single Table denotes a table that exists only once in the whole sharding topology, suitable for low‑volume data that does not require partitioning.

Sharding Key determines the placement of a row. For t_order, the order_no field might be used with a modulo operation ( order_no % 2) to decide whether the row goes to DB_1 or DB_2.

Sharding Strategy combines a sharding algorithm with one or more sharding keys to map data to nodes. Strategies can be configured independently for databases and tables.

Sharding Algorithms include hash, range, modulo, etc., each suitable for different scenarios.

Binding Table groups tables that share the same sharding rule (e.g., t_order and t_order_item both sharded by order_no) so that joins can be performed without cross‑shard traffic.

SQL parsing consists of lexical and syntactic analysis, producing a context that contains fields, tables, conditions, order‑by, group‑by, limit, etc., which later guides routing and rewriting.

Executor Optimization reorders predicates based on statistics and indexes to improve performance.

SQL Routing uses the parsed context and the configured strategy to calculate the target node(s). Routing types include standard routing, direct routing, Cartesian‑product routing, broadcast routing, and others.

SQL Rewrite replaces logical table names with the actual physical tables determined by routing, e.g., SELECT * FROM t_order becomes SELECT * FROM t_order_n.

SQL Execution sends the rewritten statements to the appropriate data sources while balancing connection and memory usage.

Result Merging aggregates result sets from multiple shards and applies ordering, grouping, pagination, and aggregation on the merged set.

Distributed Primary Key addresses ID collisions caused by independent auto‑increment columns across shards; solutions include custom ID generators (snowflake, etc.) or setting different start values and steps.

Data Masking can be applied at the sharding layer to protect sensitive fields by encrypting or randomizing them before storage.

Distributed Transaction ensures atomicity across multiple data sources, using either strong‑consistency XA protocols or flexible solutions like Seata.

Data Migration moves existing and incremental data to a new sharded cluster, often using batch jobs for historical data and dual‑write for live data.

Shadow Database (shadow table) is a full copy of the production schema used for testing migrations or performance testing without affecting live traffic.

The article concludes by listing the next topics to explore, such as read/write separation, further data‑masking techniques, distributed primary keys, transaction models, configuration and registration centers, and proxy services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Sharding horizontal scaling SQL Routing database partitioning binding tables Broadcast Table

Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.