Databases 17 min read

Common Solutions for Massive Data Storage and High‑Performance Access

The article outlines a range of techniques—including caching, read‑write splitting, vertical partitioning, sharding, cold‑hot data separation, Elasticsearch, NoSQL and NewSQL—to store and retrieve billions of user‑generated records efficiently while maintaining high throughput and low latency.

Full-Stack Internet Architecture

Jun 24, 2021

Common Solutions for Massive Data Storage and High‑Performance Access

Cache Acceleration

Cache stores frequently accessed data in memory to avoid repeated disk I/O; a single Memcached server can handle over 50,000 TPS and Redis can exceed 100,000 QPS. In‑memory reads take sub‑millisecond time compared with tens of milliseconds for disk reads, dramatically improving performance and concurrency.

Local vs. Distributed Cache

Local cache resides on each application server and expires quickly (seconds to minutes) to prevent stale data, while distributed cache forms a cluster, offers horizontal scaling, client‑side routing and stronger consistency with only ~1 ms network overhead.

Note: After introducing cache, consider read‑write splitting if database traffic remains high.

Read‑Write Splitting

Most internet services are read‑heavy and write‑light; separating read and write traffic allows independent scaling of read replicas (master‑slave architecture).

Definition

Writes are synchronously replicated to one or more standby nodes; reads are served from these replicas.

Typical Scenarios

Read‑heavy, write‑light workloads

Large data volumes

High query frequency with strict performance requirements

Implementation Approaches

Modify application code to write to the master and read from replicas (simple but tightly coupled).

Use binlog‑based tools such as Canal to capture changes and sync them to replicas, decoupling business logic from replication.

When querying, consider three strategies wrapped in code: 数据查询时，如果还没有备份完成怎么办？ Reject the query (poor user experience).

Route low‑latency‑tolerant queries to replicas with user prompts.

Route latency‑critical queries to the master.

Vertical Partitioning

Separate business domains into distinct databases (e.g., user, product, order) to isolate load, simplify maintenance, and improve performance, at the cost of introducing distributed transactions and cross‑service joins.

Database and Table Sharding

When a single database reaches its performance ceiling, split data across multiple databases (sharding) or tables.

Vertical Table Sharding narrows wide tables into narrower ones based on column usage, update frequency, size, or logical grouping.

Horizontal Table Sharding keeps the schema unchanged but distributes rows across multiple tables based on a sharding key.

Technical Points

SQL rewriting to route queries to the correct physical table.

Database routing logic for multi‑database deployments.

Result aggregation when queries span multiple shards.

Two common implementation models:

Proxy mode – a separate service handles SQL rewriting, routing and result merging.

Client mode – libraries such as sharding-jdbc are embedded in the application, performing the same functions with lower overhead.

Key steps include choosing a uniform sharding key, defining sharding strategies (range, hash, or hybrid), handling business code changes, and migrating existing data (incremental binlog sync, full data copy, dual‑write, cut‑over).

Cold‑Hot Data Separation

Based on the 80/20 rule, a small fraction of data (hot) generates most traffic. Separate hot and cold data to optimize storage and query performance.

Typical Criteria

Time‑based (e.g., orders older than 3 months are cold).

Status‑based (completed orders are cold).

Combined time + status criteria.

Implementation Methods

Inline business‑logic checks (simple but may miss time‑driven expiration).

Binlog listeners to trigger migration.

Scheduled batch jobs that move cold rows to separate storage during off‑peak hours.

Migration steps: identify cold rows, insert them into a cold table, then delete them from the hot table.

Elasticsearch for Complex Search

Elasticsearch uses inverted indexes to provide distributed full‑text search. Data must be denormalized into a single document because joins are not supported.

MySQL

Elasticsearch

Database

Index

Table

Type

Row

Document

Column

Field

Pagination Flow

Coordinator node distributes the request to all shards.

Each shard returns (from+size) document IDs and scores.

Coordinator merges, sorts globally, and selects the final page.

Coordinator fetches full documents by ID from the shards and returns them to the client.

Deep pagination degrades performance; max_result_window (default 10,000) limits page size. For deeper navigation, use search_after (e.g., id>20000 limit 10), though jump‑to‑page is not supported.

NoSQL

NoSQL abandons strict ACID guarantees to achieve horizontal scalability. Five major categories:

Key‑Value stores (e.g., Redis) – fast, support complex value types.

Document stores (e.g., MongoDB, CouchDB) – schema‑free, ideal for heterogeneous product data.

Column‑family stores (e.g., HBase, Cassandra) – efficient for read‑heavy column‑centric workloads.

Graph databases – optimized for traversing highly connected data.

Time‑series databases (e.g., InfluxDB) – excel at metric aggregation but lack update capability.

Typical use case: aggregating 10 million metric points over the last 60 days with hourly granularity.

NewSQL

NewSQL combines the scalability of distributed systems with full SQL support and ACID transactions, making it suitable for online transaction processing (OLTP). Examples include Google Cloud Spanner, Alibaba OceanBase, and CockroachDB.

Advantages: native SQL, strong consistency, high performance, horizontal scaling, and fault tolerance.