Databases 19 min read

Strategies for Massive Data Storage and Sharding in High‑Scale Systems

This article examines various approaches to storing massive data, including table partitioning, NoSQL/NewSQL, and database sharding, analyzes their advantages and drawbacks, and presents practical sharding designs for user, forum, and order databases, with considerations for caching, redundancy, and query routing.

Qunar Tech Salon

Nov 20, 2018

Strategies for Massive Data Storage and Sharding in High‑Scale Systems

Background: The author, a senior R&D engineer at Qunar, discusses the need for efficient storage and query of massive order data, questioning why index‑based search (e.g., Elasticsearch) is used instead of traditional database sharding.

Massive Data Storage Options: Three main solutions are presented – table partitioning, NoSQL/NewSQL databases, and database sharding (split‑by‑schema). Each is described with its strengths and limitations.

1. Table Partitioning: Splits a MySQL table’s physical files into smaller chunks, allowing transparent queries but limited by a single MySQL instance’s resources and inflexible partition keys.

2. NoSQL/NewSQL: Highlights MongoDB, Elasticsearch, TiDB, RocksDB, etc., noting the mature ecosystem and stability of RDBMS versus the newer, less‑feature‑complete NoSQL options.

3. Sharding (Split‑by‑Database/Table): The most common solution in internet companies, requiring middleware for routing. Various middleware examples are listed (e.g., Alibaba TDDL, DRDS, sharding‑jdbc, MyCAT, Vitess, etc.) and two architectural styles are compared: client‑side sharding and server‑side proxy.

Client‑Side Sharding: Simple deployment as a library in each application; drawbacks include connection‑pool explosion and complex routing configuration.

Server‑Side Proxy: Centralized proxy manages connections and routing, simplifying scaling but adding complexity and a single point of failure.

Core Sharding Steps: SQL parsing, rewriting, routing, execution, and result merging.

Database Group vs. Sharding: A database group (master‑slave cluster) provides read/write separation and high availability, while sharding distributes data across independent nodes without replication, solving storage and write‑pressure problems.

Sharding Strategies: Range‑based and hash‑based partitioning are explained, with their respective pros and cons regarding data distribution and scaling.

Practical Sharding Cases:

User Database: Uses UID as the sharding key; discusses handling login queries by username/email via redundant mapping tables, caching UID‑username relations in Redis, or embedding username features into UID generation.

Forum Post Database: Handles high‑volume post retrieval by ID; suggests using Elasticsearch as an external index and optionally storing UID‑post mappings for user‑centric queries.

Order Database: Primary query by order ID, plus buyer and seller lookups; proposes redundant storage of buyer/seller IDs, vertical sharding by agent, and archiving cold data to HBase with ES for search.

Qunar Ticketing Example: Describes a vertical sharding strategy by travel agency (agent) dimension, redundant agent codes in order IDs, and ES for auxiliary queries, noting data imbalance and cold‑data archiving challenges.

Summary Table: Provides a comparative matrix of different sharding scenarios (single, multiple, multiple+ES, multiple+redundancy+ES+HBase) covering applicability, query timeliness, storage capacity, and architectural complexity.

Conclusion: Effective massive data storage requires careful selection of sharding schemes, ID generation, routing, and consistency mechanisms; no single middleware solves all problems, and designs must align with specific business requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

sharding caching Partitioning

Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.