Why Sharding (Database Partitioning) Beats Partitioning and NoSQL for Massive Data
The article explains why sharding (splitting databases and tables) is the preferred solution for handling massive user, order, and transaction data in high‑traffic internet applications, comparing it with partitioning and NoSQL/NewSQL alternatives, and detailing practical middleware choices, sharding column selection, and integration with Elasticsearch and HBase.
In the mobile‑internet era, core tables such as user, order, and transaction logs quickly reach billions of rows, far exceeding the capacity of a single MySQL table. Although MySQL can store up to a billion rows, performance degrades sharply beyond 1 000 GB, so data must be distributed across multiple databases or tables.
Why Not NoSQL/NewSQL?
RDBMS still offers a mature ecosystem, absolute stability, and strong transactional guarantees that newer NoSQL/NewSQL solutions (e.g., MongoDB, Elasticsearch, TiDB) cannot match when reliability is paramount. Most companies therefore keep RDBMS as the primary store and treat NoSQL/NewSQL as a supplement.
Why Not Simple Partitioning?
Partitioned tables hide sharding details but still rely on a single MySQL instance as the entry point, limiting concurrency and network throughput. They also cannot use foreign keys or full‑text indexes, which are rarely needed in modern projects.
Why Sharding (Database‑Table Splitting)?
Sharding is the de‑facto method for internet‑scale data. Numerous middleware solutions exist, such as Alibaba's TDDL/DRDS, Cobar, Sharding‑Sphere (formerly Sharding‑JDBC), MyCAT, 360 Atlas, and Meituan Zebra. These fall into two architectural patterns:
CLIENT mode : the application driver parses, rewrites, routes, executes, and merges results (e.g., TDDL, Sharding‑Sphere proxy).
PROXY mode : a proxy server handles routing and merging (e.g., Cobar, MyCAT).
The author prefers CLIENT mode for its simpler architecture, lower performance overhead, and reduced operational cost.
Practical Sharding Cases
The most critical step is choosing the sharding column . It should be a high‑traffic API parameter (e.g., user_id for user‑centric services). Examples:
Order table : three sharding columns – order_id, user_id, merchant_code – each handling a major query pattern.
User table : possible sharding columns include mobile_no, email, username, and user_id.
Account table : account_no is the natural sharding key.
Two data‑redundancy strategies exist:
Full‑copy tables : each sharding column stores the complete dataset, offering the best query speed at the cost of several times more storage and higher maintenance.
Relation‑only tables : only one sharding column holds full data; others store only index relations, saving space but requiring an extra lookup for non‑primary sharding keys.
Complex Queries Without Sharding Columns
When a query lacks a sharding key, performance drops because the request must be routed to every shard and results merged, which can become a bottleneck. The common remedy is to replicate the full dataset into Elasticsearch (or Solr) and let the search engine handle fuzzy or multi‑condition queries.
es+HBase Combination
For massive tables, sharding + Elasticsearch can overload the ES cluster. A better pattern stores only the searchable fields in ES while keeping the full rows in HBase. Queries first retrieve matching rowkey s from ES, then fetch the complete records from HBase, leveraging HBase’s lightning‑fast row‑key lookups.
Summary
Sharding is essential for high‑concurrency, large‑scale data, but it is not a silver bullet. Choose middleware that fits the workload, use a well‑designed sharding column, offload complex or fuzzy queries to Elasticsearch, and store raw data in HBase for scalability. Proper analysis of business patterns, storage costs, and maintenance effort is required to build a robust, end‑to‑end data architecture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
