Mastering Sharding: When and How to Use Partitioning, Sharding, and NoSQL
This article explains why modern applications dealing with massive user, order, and transaction data must move beyond single‑table designs, compares partitioning, sharding (分库分表), and NoSQL/NewSQL solutions, and provides practical guidance on choosing middleware, sharding columns, and hybrid architectures such as es + HBase.
Why Sharding Matters
In the mobile‑Internet era, services generate billions of rows daily—user tables, order tables, transaction logs—far exceeding the optimal size of a single MySQL table (≈1 KB of BTREE index height). When a single table can no longer hold the data efficiently, three common strategies appear:
Partitioning
Sharding (分库分表)
NoSQL/NewSQL
All three are variations of data distribution; sharding can be viewed as a combination of database and table splitting. NoSQL examples include MongoDB and Elasticsearch; NewSQL examples include TiDB.
Why Not NoSQL/NewSQL?
Relational DBMSs still dominate because they offer a mature ecosystem, absolute stability, and strong transactional guarantees. Most companies keep RDBMS as the primary store and treat NoSQL/NewSQL as complementary, not a replacement.
Why Not Simple Partitioning?
Partitioned tables hide sharding details and work without a sharding key, but they inherit single‑instance limits on connections, network throughput, and overall concurrency, making them unsuitable for high‑traffic internet services.
Why Sharding (分库分表) Is Preferred
Sharding distributes data across multiple databases and tables, allowing horizontal scaling. Numerous middleware solutions exist, such as Alibaba's TDDL/DRDS, Sharding‑Sphere (formerly Sharding‑JDBC), MyCAT, 360 Atlas, and Meituan Zebra. These fall into two deployment models:
Client mode (e.g., TDDL, Sharding‑Sphere client)
Proxy mode (e.g., Cobar, MyCAT)
Both models share the same core steps: SQL parsing, rewriting, routing, execution, and result merging. The author prefers client mode for its simpler architecture, lower performance overhead, and easier operations.
Practical Sharding Cases
The first step is selecting the sharding column , which should align with the most frequent API traffic (e.g., user_id for user‑centric services). Common patterns include:
Single sharding column
Multiple sharding columns (each with its own set of shards)
Sharding column plus Elasticsearch for full‑text search
Examples:
Order Table
Alibaba’s order system uses three sharding columns: order_id, user_id, and merchant_code. The table can be implemented as either a fully redundant set (all columns stored in every shard) or a primary‑key‑only full table with secondary relationship tables. The trade‑off is speed versus storage cost.
Fully redundant tables offer faster queries but consume several times more storage and increase maintenance effort.
User Table
User data often requires sharding on user_id, mobile_no, email, or username. Because the total user count (≈7‑10 billion) is still manageable, a single sharding column plus Elasticsearch is usually sufficient.
Account Table
Account APIs commonly filter by account_no, making it a natural sharding key.
Handling Queries Without a Sharding Key
When a query lacks a sharding column, the middleware must route the request to every shard and merge results, which can degrade performance dramatically as the number of shards grows.
For complex or fuzzy searches, the typical solution is to replicate all data to Elasticsearch and let it handle the query, optionally storing the full data in HBase for massive storage.
Hybrid es + HBase Architecture
Elasticsearch provides fast multi‑condition search, while HBase offers massive, low‑latency row‑key lookups. The workflow is: query Elasticsearch for matching rowkeys, then fetch the full records from HBase. This separation leverages the strengths of both systems.
Summary
Sharding is not a one‑size‑fits‑all solution; it requires careful analysis of business traffic, sharding column selection, and complementary technologies such as Elasticsearch and HBase. The choice between client and proxy middleware, full‑redundant tables versus relationship tables, and the addition of search/index layers determines scalability, storage cost, and operational complexity.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
