How Xiaomi Scaled Redis with Kubernetes: Resource Isolation, Automation, and Proxy Design
This article details Xiaomi's migration of tens of thousands of Redis instances from physical servers to Kubernetes, explaining the motivations for containerization, the deployment architecture with StatefulSets and a custom Redis Proxy, performance comparisons of proxy options, operational challenges, and the resulting benefits in resource utilization, automation, and service stability.
Background
Xiaomi operates a massive Redis deployment with tens of thousands of instances and hundreds of trillions of daily accesses, supporting almost all product lines. The original physical‑machine setup lacked resource isolation, causing high CPU contention, unpredictable latency, and heavy manual intervention for node failures.
Why Kubernetes?
Resource isolation : Kubernetes allows setting CPU requests and limits, preventing a single Redis node from monopolizing CPU and reducing latency spikes.
Automation : Deploying Redis clusters via StatefulSets and ConfigMaps reduces the manual, hour‑long process of locating free machines, editing configs, and running redis_trib to just a few minutes.
How Kubernetes Is Used
Clients access Redis through an LVS VIP, which forwards traffic to a Redis Proxy. The Redis Cluster runs as a StatefulSet, persisting RDB/AOF files on Ceph Block Service PVs. Although Ceph adds 100‑200 ms of storage latency, Redis writes are asynchronous, so overall performance is unaffected.
Proxy Selection and Performance Test
Open‑source proxies considered were Codis, Twemproxy, redis‑cluster‑proxy, Cerberus, and Predixy. Xiaomi chose to evaluate Cerberus and Predixy on Kubernetes.
Test environment
Benchmark tool: redis-benchmark Proxy CPU: 2 cores
Client CPU: 2 cores
Redis Cluster: 3 master nodes, 1 CPU per node
Test results
Predixy achieved 33‑60 % higher QPS than Cerberus under the same workload, with comparable latency, especially as key/value size increased. Consequently, Predixy was selected.
Proxy Deployment and Features
Deployment : Proxy runs as a Deployment (stateless), fronted by LVS for load balancing and dynamic scaling.
Auto‑scaling : Kubernetes HPA scales Proxy pods based on average CPU usage, with a 171‑second grace period before scaling down to avoid rapid fluctuations.
Dynamic cluster switching : Proxy can add or switch backend Redis clusters on the fly, enabling seamless upgrades and expansions without client changes.
Why a Proxy?
IP stability: Kubernetes pod IPs can change on restart, breaking client connections; the Proxy hides cluster topology behind a stable VIP.
Connection load: Offloading client‑side hashing to the Proxy reduces CPU usage on Redis nodes.
Zero‑downtime migrations: Proxy allows data sync and cluster switch without restarting client applications.
Benefits of the Kubernetes‑Based Architecture
Ease of deployment : API‑driven cluster creation reduces operational effort.
Port management : Pods have unique IPs, eliminating the port exhaustion issue faced on physical machines.
Lower client barrier : Applications use a simple non‑smart client to connect to the VIP.
Improved client performance : Reduced client‑side hashing overhead.
Dynamic scaling and upgrades : API‑controlled scaling and rolling updates.
Higher resource utilization and stability : Kubernetes isolates resources while maintaining service reliability.
Encountered Issues and Mitigations
Pod restart data loss : Using StatefulSet with pre‑stop hooks (sleep 171 seconds) ensures LVS drains traffic before pod termination.
StatefulSet limitations : Native StatefulSet cannot fully satisfy Redis Cluster placement constraints; a custom CRD RedisStatefulSet with anti‑affinity strategies was developed.
LVS mapping latency : Measured sub‑0.1 ms delay for get/set operations.
Security and auditing : Proxy adds password‑based role separation and logs high‑risk commands, addressing Redis's lack of native ACLs.
Conclusion
After more than six months of operation, dozens of Redis clusters serving multiple business lines run on Kubernetes with significantly reduced operational overhead and improved stability. Ongoing work focuses on further enhancing resource efficiency, security, and multi‑cluster support.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
