Designing a High‑Availability, Auto‑Scaling KV Storage System Based on Memcached and Redis
This article examines common NoSQL key‑value stores such as Memcached and Redis, compares their strengths and limitations, and proposes a distributed architecture with routing, storage, management, and migration nodes that achieves high availability, automatic fault‑tolerance, load balancing, and elastic scaling.
Unlike the early Internet era, modern social and mobile applications generate massive read/write requests and explosive data growth, making traditional relational databases a performance and scalability bottleneck. NoSQL databases, especially key‑value (KV) stores like Memcached and Redis, offer simple data models, high performance, and easy scalability.
KV stores are widely used both as caches and as primary data stores. The classic cache‑plus‑DB pattern stores frequently accessed data in a cache (e.g., Memcached) to reduce database load, while writes still go directly to the DB and cache entries are invalidated.
Memcached stores data only in memory, making it suitable for cache‑layer use cases such as static pages, configuration data, hot user profiles, and near‑real‑time statistics. However, it cannot improve write performance, suffers from cache‑misses on dispersed hot data, and loses all data on node failure, requiring a warm‑up period that can stress the database.
Redis improves on Memcached by providing persistence (AOF and RDB), master‑slave replication with read/write separation, and a rich set of data structures (string, hash, list, set, sorted set, etc.). These features enable Redis to serve both as a cache and as a primary store for persistent data, offering higher read/write performance, atomic operations, and use‑case flexibility such as counters, leaderboards, de‑duplication, and precise expiration.
Despite its advantages, Redis lacks built‑in automatic fault‑tolerance; a master or slave failure can cause temporary request failures and may lead to data inconsistency. Full‑copy replication consumes significant memory and can impact cluster performance, and online scaling is complex, requiring careful capacity planning.
The article proposes a high‑availability, auto‑elastic KV storage system that combines the protocols of Memcached and Redis, uses memory as the primary medium, and provides ultra‑high read/write throughput, automatic failover, load balancing, and elastic scaling.
System Architecture
The system is distributed across four node types: a stateless routing node that hashes keys to storage nodes; storage nodes that handle read/write and report heartbeats; a management node that maintains routing tables, decides on failover, load balancing, and scaling; and a migration node that performs data copy/move during rebalancing.
All nodes except the routing nodes run in a primary‑backup (master‑slave) mode, and multiple routing nodes can sit behind an LVS load balancer for fault tolerance.
Key Technical Points
Data Distribution
Data is distributed using consistent hashing with virtual nodes to achieve balanced key placement and monotonicity, ensuring minimal data movement when nodes are added or removed.
Automatic Fault‑Tolerance and Recovery
Storage nodes periodically send heartbeats to the management node. If a master stops reporting, the backup is promoted, the routing table is updated, and traffic is redirected. The system supports both full‑copy and incremental recovery modes, automatically choosing based on the downtime duration.
Load Balancing and Elastic Scaling
Storage nodes report I/O and capacity metrics. The management node detects load or capacity imbalance and instructs the migration node to move virtual nodes (and their keys) from overloaded to under‑utilized nodes. When cluster load exceeds thresholds, new nodes are added; when load drops, nodes are removed, all while preserving consistent‑hash monotonicity.
Images illustrating the NoSQL landscape, system architecture, consistent‑hash process, fault‑tolerance flow, and node scaling are included in the original article.
Conclusion
Providing high availability and automatic elastic scaling for a KV storage system addresses critical operational challenges such as node failures, disk crashes, and sudden traffic spikes, while reducing operational overhead and risk, thereby ensuring a durable and stable service.
Art of Distributed System Architecture Design
Introductions to large-scale distributed system architectures; insights and knowledge sharing on large-scale internet system architecture; front-end web architecture overviews; practical tips and experiences with PHP, JavaScript, Erlang, C/C++ and other languages in large-scale internet system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.