When and How to Split Databases: Strategies, Benefits, and Pitfalls
This article explains why and when to shard databases, compares sharding methods such as key‑based, range‑based, and dictionary approaches, and outlines the performance gains, availability improvements, and new challenges like increased complexity, ID handling, cross‑shard queries, and distributed transactions.
As data grows, database pressure increases, often requiring database splitting.
Sharding (splitting databases) by business dimension resolves IO contention and single‑machine capacity limits, while table splitting addresses capacity and disk/bandwidth IO pressure.
Splitting offers benefits such as easier horizontal scaling, improved query performance by reducing scanned rows, and higher availability because failures affect only a portion of the system.
However, it also adds development and dimensional complexity and can disable features like joins and transactions.
Typical scenarios for splitting include data growth exceeding a single database's storage, read/write loads surpassing a single node or master‑slave capacity, and network bandwidth pressure causing slow responses or timeouts.
Data Splitting Methods
(1) Key‑Based Sharding
Hash a specified key to obtain a numeric shard identifier. This common method requires no mapping maintenance and distributes data evenly, avoiding hotspots, but adding or removing servers forces a rehash of existing data.
(2) Range‑Based Sharding
When a column holds numeric values (e.g., price), data can be divided by value ranges.
This approach is simple but may create hotspots if many records fall into a single range.
(3) Dictionary‑Based Sharding
A dictionary table maps each key to its shard, offering great flexibility and easy server scaling.
The downside is the need to query the dictionary for every operation, creating a performance bottleneck and a single point of failure.
Problems Introduced by Sharding and Solutions
1. Auto‑Increment ID Issue – Resolve by using separate tables, different auto‑increment steps, or distributed ID generators.
2. Cross‑Shard Join, Sorting, and Pagination – Options include scanning all tables and aggregating, creating global tables, heterogeneous query dimensions, or syncing data to Elasticsearch for search.
3. Distributed Transaction Challenges – Employ transaction tables, compensation mechanisms, TCC (Try/Confirm/Cancel), or Saga patterns, and favor eventual consistency over strong consistency.
In some cases, complementing relational databases with NoSQL solutions can help address these challenges.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
