Cloud Native 18 min read

How Xiaomi Scaled Redis with Kubernetes: Resource Isolation, Automation, and Proxy Design

This article details Xiaomi's migration of tens of thousands of Redis instances from physical servers to Kubernetes, explaining the motivations for containerization, the deployment architecture with StatefulSets and a custom Redis Proxy, performance comparisons of proxy options, operational challenges, and the resulting benefits in resource utilization, automation, and service stability.

dbaplus Community

Nov 16, 2020

How Xiaomi Scaled Redis with Kubernetes: Resource Isolation, Automation, and Proxy Design

Background

Xiaomi operates a massive Redis deployment with tens of thousands of instances and hundreds of trillions of daily accesses, supporting almost all product lines. The original physical‑machine setup lacked resource isolation, causing high CPU contention, unpredictable latency, and heavy manual intervention for node failures.

Why Kubernetes?

Resource isolation : Kubernetes allows setting CPU requests and limits, preventing a single Redis node from monopolizing CPU and reducing latency spikes.

Automation : Deploying Redis clusters via StatefulSets and ConfigMaps reduces the manual, hour‑long process of locating free machines, editing configs, and running redis_trib to just a few minutes.

How Kubernetes Is Used

Clients access Redis through an LVS VIP, which forwards traffic to a Redis Proxy. The Redis Cluster runs as a StatefulSet, persisting RDB/AOF files on Ceph Block Service PVs. Although Ceph adds 100‑200 ms of storage latency, Redis writes are asynchronous, so overall performance is unaffected.

Proxy Selection and Performance Test

Open‑source proxies considered were Codis, Twemproxy, redis‑cluster‑proxy, Cerberus, and Predixy. Xiaomi chose to evaluate Cerberus and Predixy on Kubernetes.

Test environment

Benchmark tool: redis-benchmark Proxy CPU: 2 cores

Client CPU: 2 cores

Redis Cluster: 3 master nodes, 1 CPU per node

Test results

Predixy achieved 33‑60 % higher QPS than Cerberus under the same workload, with comparable latency, especially as key/value size increased. Consequently, Predixy was selected.

Proxy Deployment and Features

Deployment : Proxy runs as a Deployment (stateless), fronted by LVS for load balancing and dynamic scaling.

Auto‑scaling : Kubernetes HPA scales Proxy pods based on average CPU usage, with a 171‑second grace period before scaling down to avoid rapid fluctuations.

Dynamic cluster switching : Proxy can add or switch backend Redis clusters on the fly, enabling seamless upgrades and expansions without client changes.

Why a Proxy?

IP stability: Kubernetes pod IPs can change on restart, breaking client connections; the Proxy hides cluster topology behind a stable VIP.

Connection load: Offloading client‑side hashing to the Proxy reduces CPU usage on Redis nodes.

Zero‑downtime migrations: Proxy allows data sync and cluster switch without restarting client applications.

Benefits of the Kubernetes‑Based Architecture

Ease of deployment : API‑driven cluster creation reduces operational effort.

Port management : Pods have unique IPs, eliminating the port exhaustion issue faced on physical machines.

Lower client barrier : Applications use a simple non‑smart client to connect to the VIP.

Improved client performance : Reduced client‑side hashing overhead.

Dynamic scaling and upgrades : API‑controlled scaling and rolling updates.

Higher resource utilization and stability : Kubernetes isolates resources while maintaining service reliability.

Encountered Issues and Mitigations

Pod restart data loss : Using StatefulSet with pre‑stop hooks (sleep 171 seconds) ensures LVS drains traffic before pod termination.

StatefulSet limitations : Native StatefulSet cannot fully satisfy Redis Cluster placement constraints; a custom CRD RedisStatefulSet with anti‑affinity strategies was developed.

LVS mapping latency : Measured sub‑0.1 ms delay for get/set operations.

Security and auditing : Proxy adds password‑based role separation and logs high‑risk commands, addressing Redis's lack of native ACLs.

Conclusion

After more than six months of operation, dozens of Redis clusters serving multiple business lines run on Kubernetes with significantly reduced operational overhead and improved stability. Ongoing work focuses on further enhancing resource efficiency, security, and multi‑cluster support.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

proxy Automation Kubernetes Redis Resource Isolation

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.