Common Solutions for Massive Data Storage and High‑Performance Access
The article outlines a range of techniques—including caching, read‑write splitting, vertical partitioning, sharding, cold‑hot data separation, Elasticsearch, NoSQL and NewSQL—to store and retrieve billions of user‑generated records efficiently while maintaining high throughput and low latency.
Cache Acceleration
Cache stores frequently accessed data in memory to avoid repeated disk I/O; a single Memcached server can handle over 50,000 TPS and Redis can exceed 100,000 QPS. In‑memory reads take sub‑millisecond time compared with tens of milliseconds for disk reads, dramatically improving performance and concurrency.
Local vs. Distributed Cache
Local cache resides on each application server and expires quickly (seconds to minutes) to prevent stale data, while distributed cache forms a cluster, offers horizontal scaling, client‑side routing and stronger consistency with only ~1 ms network overhead.
Note: After introducing cache, consider read‑write splitting if database traffic remains high.
Read‑Write Splitting
Most internet services are read‑heavy and write‑light; separating read and write traffic allows independent scaling of read replicas (master‑slave architecture).
Definition
Writes are synchronously replicated to one or more standby nodes; reads are served from these replicas.
Typical Scenarios
Read‑heavy, write‑light workloads
Large data volumes
High query frequency with strict performance requirements
Implementation Approaches
Modify application code to write to the master and read from replicas (simple but tightly coupled).
Use binlog‑based tools such as Canal to capture changes and sync them to replicas, decoupling business logic from replication.
When querying, consider three strategies wrapped in code:
数据查询时,如果还没有备份完成怎么办?Reject the query (poor user experience).
Route low‑latency‑tolerant queries to replicas with user prompts.
Route latency‑critical queries to the master.
Vertical Partitioning
Separate business domains into distinct databases (e.g., user, product, order) to isolate load, simplify maintenance, and improve performance, at the cost of introducing distributed transactions and cross‑service joins.
Database and Table Sharding
When a single database reaches its performance ceiling, split data across multiple databases (sharding) or tables.
Vertical Table Sharding narrows wide tables into narrower ones based on column usage, update frequency, size, or logical grouping.
Horizontal Table Sharding keeps the schema unchanged but distributes rows across multiple tables based on a sharding key.
Technical Points
SQL rewriting to route queries to the correct physical table.
Database routing logic for multi‑database deployments.
Result aggregation when queries span multiple shards.
Two common implementation models:
Proxy mode – a separate service handles SQL rewriting, routing and result merging.
Client mode – libraries such as sharding-jdbc are embedded in the application, performing the same functions with lower overhead.
Key steps include choosing a uniform sharding key, defining sharding strategies (range, hash, or hybrid), handling business code changes, and migrating existing data (incremental binlog sync, full data copy, dual‑write, cut‑over).
Cold‑Hot Data Separation
Based on the 80/20 rule, a small fraction of data (hot) generates most traffic. Separate hot and cold data to optimize storage and query performance.
Typical Criteria
Time‑based (e.g., orders older than 3 months are cold).
Status‑based (completed orders are cold).
Combined time + status criteria.
Implementation Methods
Inline business‑logic checks (simple but may miss time‑driven expiration).
Binlog listeners to trigger migration.
Scheduled batch jobs that move cold rows to separate storage during off‑peak hours.
Migration steps: identify cold rows, insert them into a cold table, then delete them from the hot table.
Elasticsearch for Complex Search
Elasticsearch uses inverted indexes to provide distributed full‑text search. Data must be denormalized into a single document because joins are not supported.
MySQL
Elasticsearch
Database
Index
Table
Type
Row
Document
Column
Field
Pagination Flow
Coordinator node distributes the request to all shards.
Each shard returns (from+size) document IDs and scores.
Coordinator merges, sorts globally, and selects the final page.
Coordinator fetches full documents by ID from the shards and returns them to the client.
Deep pagination degrades performance; max_result_window (default 10,000) limits page size. For deeper navigation, use search_after (e.g., id>20000 limit 10 ), though jump‑to‑page is not supported.
NoSQL
NoSQL abandons strict ACID guarantees to achieve horizontal scalability. Five major categories:
Key‑Value stores (e.g., Redis) – fast, support complex value types.
Document stores (e.g., MongoDB, CouchDB) – schema‑free, ideal for heterogeneous product data.
Column‑family stores (e.g., HBase, Cassandra) – efficient for read‑heavy column‑centric workloads.
Graph databases – optimized for traversing highly connected data.
Time‑series databases (e.g., InfluxDB) – excel at metric aggregation but lack update capability.
Typical use case: aggregating 10 million metric points over the last 60 days with hourly granularity.
NewSQL
NewSQL combines the scalability of distributed systems with full SQL support and ACID transactions, making it suitable for online transaction processing (OLTP). Examples include Google Cloud Spanner, Alibaba OceanBase, and CockroachDB.
Advantages: native SQL, strong consistency, high performance, horizontal scaling, and fault tolerance.
Recommended Reading
Spring Boot Integration with Kafka
Distributed Configuration Center Selection – Why Apollo?
Why MySQL Uses Repeatable Read as Default Isolation Level?
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.