Mastering NoSQL at Scale: KV Architecture Evolution, Codis+, and Aerospike Insights
In the era of big data, DBAs face higher demands and NoSQL technologies have risen, so this article shares GeTui's KV storage architecture evolution, operational challenges, NoSQL selection criteria, Codis+ enhancements, Aerospike evaluation, monitoring practices, and best‑practice recommendations for scalable database operations.
In the big‑data era, enterprises demand more from DBAs, and NoSQL has attracted increasing attention. Based on the DBA work of GeTui’s SRA team, this article shares two main topics: the evolution of the company’s KV storage architecture and the operational problems it must solve, and thoughts on NoSQL selection and future development.
NoSQL Origins
The first general‑purpose computer appeared in 1946, but only with the advent of RDBMS in the 1970s did a universal data‑storage solution emerge. In the 21st century, data volume became a critical issue, prompting Google and Amazon to propose their own NoSQL solutions, such as Google’s Bigtable in 2006. The term “NoSQL” was formally introduced at a 2009 conference, and by April 2018 there were 225 NoSQL solutions, a small subset of which GeTui uses.
Key Differences Between NoSQL and RDBMS
NoSQL offers schema‑less flexibility and native scalability. Replication provides read‑scale and high availability, while sharding solves both read/write and capacity scaling. Most NoSQL products combine replication and sharding.
Sharding Techniques
Sharding divides data either by range (e.g., HBase row‑key) or by hash. To address hash‑based monotonicity and balance, virtual nodes are widely used; Codis also adopts virtual nodes, creating a mapping layer between data shards and host servers.
GeTui’s Common NoSQL Solutions
GeTui’s Redis system scale is shown in the diagram below. Initially, the architecture used Redis for caching and MySQL for persistence. From 2012‑2016, rapid business growth made a single node insufficient, leading to a self‑developed Redis sharding solution and a custom client that supports read/write ratios, fault detection, slow‑query monitoring, and health checks. Later, the open‑source Codis project from the Wandoujia team was adopted.
Advantages of GeTui Codis+
Codis is a proxy‑based architecture that supports native clients, web‑based cluster operations, monitoring, and integrates Redis Sentinel, improving operational efficiency and HA deployment. Codis+ adds three enhancements:
2N+1 replica scheme to eliminate master single‑point failure during faults.
Redis semi‑synchronous reads, allowing slaves to serve reads within a configurable timeout (e.g., 5 seconds).
Resource pooling similar to HBase’s region server expansion.
Additional features include rack awareness and cross‑IDC synchronization, which are typically enterprise‑grade capabilities.
Why Not Native Redis Cluster?
Native Redis cluster couples routing and data management, so a failure in one component can corrupt data. Its peer‑to‑peer consensus becomes slow in large clusters, whereas Codis uses a tree‑type architecture that avoids this bottleneck. Moreover, native clusters lack endorsement from large platforms.
Evaluating Aerospike
GeTui is testing Aerospike as a replacement for part of the Redis cluster because Redis’s in‑memory model incurs high TCO. Aerospike can store data on SSDs with optimizations, offers resource pooling, and supports rack awareness and cross‑IDC sync (enterprise version). Two internal services using Aerospike achieved nearly 100 k QPS on a single physical machine with an Intel NVMe SSD, making it cost‑effective for large‑capacity, moderate‑QPS workloads.
Operational Challenges and Practices
Standardized installation is achieved through three parts—OS standardization, Redis file/directory standards, and Redis parameter standardization—implemented with SaltStack and CMDB.
Scaling and shrinking have become easier thanks to Codis, and Aerospike further simplifies these operations.
Monitoring is critical to reduce operational cost. GeTui recommends reading “Site Reliability Engineering: Google’s Secrets of Service Management.” The monitoring system tracks three objects: clusters, instances, and hosts, maintaining metadata relationships for global aggregation. Zabbix was the primary monitoring platform for three years but suffers from MySQL TPS limits and inflexibility. Open‑Falcon solved some issues but lacked alert flexibility; custom extensions were added.
Common Pitfalls
Master‑slave reset can cause master overload and service disruption. Causes include a small repl-backlog-size (default 1 MB), repl-timeout (default 60 s), and low tcp-backlog. Using Redis 2.8.20 makes resets frequent.
Oversized nodes lead to long persistence times, high swap usage, and increased risk of master‑slave reset. Splitting nodes efficiently and increasing shard count (e.g., from 500 to 1024 or 16384) mitigates this.
Case Studies
Case 1: Master‑slave reset during peak messaging – High load caused TCP packet loss, triggering repl-timeout and master‑slave reset. The root cause was default parameters and oversized nodes.
Case 2: Codis master‑slave switch issue – After a host failure, Codis performed a master‑slave switch, but re‑establishing replication failed because the master’s buffer overflowed while the slave was still syncing, causing a reset loop.
Best‑Practice Recommendations
Configure CPU affinity for Redis processes.
Keep node size around 10 GB.
Ensure host free memory > node size + 10 GB; avoid swap.
Increase tcp-backlog, repl-backlog-size, and repl-timeout appropriately.
Let the master avoid persistence; let slaves use AOF with periodic resets.
NoSQL Selection Guidelines
Match the solution to business logic (KV, graph, etc.).
Consider load characteristics (QPS, TPS, latency).
Account for data scale; massive TB/PB workloads often require Hadoop‑style systems.
Evaluate operational cost, monitoring ease, and scalability.
Check for successful case studies, documentation, community support, and vendor backing.
Conclusion
NoSQL has evolved from “know SQL” in the 1980s to “Not only SQL” in 2005 and now “No SQL.” Its progress reflects continuous learning, thoughtful design, and relentless experimentation by engineers.
Author: Internet Architect
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
