How Xiaomi Scaled Redis with Kubernetes: Deploying Redis Cluster on K8s
This article explains how Xiaomi migrated tens of thousands of Redis instances from bare‑metal servers to Kubernetes, using Redis Proxy, StatefulSets, and Ceph storage to achieve resource isolation, automated deployment, dynamic scaling, and improved reliability while addressing latency, IP‑change, and security challenges.
Background
Xiaomi operates tens of thousands of Redis instances with millions of billions of daily accesses, supporting almost all product lines. Previously all Redis nodes ran on physical machines without resource isolation, causing high operational overhead, frequent manual interventions for node failures, and unpredictable latency due to CPU contention.
Because CPU was not isolated, heavy RDB writes or traffic spikes could raise a node’s CPU usage and affect other clusters, increasing latency. The Redis Cluster protocol requires intelligent clients with many configuration parameters, which many developers struggle to set correctly, adding load to application servers.
Why Kubernetes
Resource Isolation
Physical deployments mixed many business lines, leading to CPU contention and difficult troubleshooting. Kubernetes allows explicit CPU requests and limits, improving utilization while preventing resource starvation.
Automated Deployment
On bare metal, deploying a new Redis Cluster required manual resource discovery, configuration edits, and running
redis_trib, taking one to two hours. With Kubernetes StatefulSets and ConfigMaps, a new cluster can be provisioned in minutes.
How Kubernetes
Clients access Redis through an LVS VIP, which forwards requests to a Redis Proxy that then routes traffic to the Redis Cluster.
Redis Cluster Deployment
The cluster runs as a StatefulSet, persisting RDB/AOF files on Ceph Block Service via PVCs. Although Ceph adds 100‑200 ms of I/O latency, Redis writes are asynchronous, so the impact on service latency is negligible.
Proxy Selection
Open‑source proxies were evaluated; Codis and Twemproxy were excluded because they do not support Redis Cluster. The official
redis‑cluster‑proxylacks a stable release. Two candidates, Cerberus and Predixy, were benchmarked using
redis‑benchmarkon a 2‑core Proxy and 2‑core client, with three master nodes (1 CPU each). Results showed Predixy delivering 33‑60 % higher QPS and comparable latency, especially with larger keys, leading to its selection.
Predixy was further enhanced with dynamic backend switching, black‑/white‑listing, and audit logging.
Proxy Deployment
The proxy runs as a stateless Deployment behind a LoadBalancer, enabling easy horizontal scaling. Dynamic backend switching allows seamless cluster upgrades without client changes.
Proxy Autoscaling
Kubernetes HPA monitors average CPU usage; when it exceeds a threshold, the replica count is increased. After scaling, LVS detects new pods and redirects traffic. To avoid rapid oscillations, scaling down is suppressed for five minutes after a scale‑up event, and minimum/maximum pod counts are configurable.
Why Proxy
IP Changes on Pod Restart
Redis Cluster clients need fixed IP/Port endpoints; pod restarts in Kubernetes can change IPs, breaking client connections. The proxy hides cluster topology changes, exposing a stable VIP.
Connection Load
Before Redis 6.0, Redis handled most work in a single thread, so many client connections increased CPU usage and latency. Offloading connections to the proxy reduces Redis load.
Cluster Migration
Traditional physical‑machine migrations require manual data sync, service restarts, and risk. With a proxy, backend cluster creation, sync, and switch are invisible to clients; a simple command moves traffic to the new cluster.
Security
Clients authenticate only to the proxy, not directly to Redis, allowing separate passwords for administrators and users, and enabling operation restrictions (e.g., blocking
FLUSHDB,
CONFIG SET) and audit logging.
Proxy‑Induced Issues
Extra Hop Latency
The additional hop adds ~0.2‑0.3 ms latency, which is generally acceptable.
Pod IP Changes
Proxy pods also change IP on restart; the LVS configuration updates automatically to route traffic to new pods.
LVS Mapping Latency
Tests show LVS adds less than 0.1 ms for get/set operations of varying data sizes.
Benefits Brought by Kubernetes
Easy Deployment
Deploying via the operations platform’s K8s API dramatically speeds up provisioning.
Port Management
Each pod receives its own IP, eliminating the need for thousands of unique host ports and avoiding port exhaustion.
Lower Client Barrier
Applications can use a simple non‑smart client against the VIP, avoiding complex cluster‑aware client configuration.
Improved Client Performance
Offloading hashing and routing to the proxy reduces client CPU usage.
Dynamic Upgrade & Scaling
Proxy commands add or switch backend clusters without client awareness. Scaling from 30 to 60 nodes can be done with API‑driven creation, data sync, and a single proxy command, eliminating service restarts.
Stability & Resource Utilization
K8s native resource isolation allows mixed workloads while maintaining Redis’s high availability and latency requirements.
Encountered Problems
Pod Restart Data Loss
If a master pod restarts without AOF, in‑memory data is lost and only the last RDB snapshot is reloaded. The solution is to force a slave to failover before restart and rely on PVC‑backed persistence.
LVS Mapping Delay
To give LVS time to update mappings, a preStop hook sleeps for 171 seconds before pod termination.
StatefulSet Limitations
Native StatefulSet cannot guarantee that master‑slave pairs are on different nodes or that no more than a quarter of nodes share a host, leading to low utilization. A custom CRD,
RedisStatefulSet, was created to apply anti‑affinity strategies and additional management features.
Summary
Several business units have migrated dozens of Redis clusters to Kubernetes, achieving lower operational effort and higher stability. While many challenges remain—such as data‑loss handling, IP volatility, and StatefulSet constraints—the migration demonstrates that Kubernetes can effectively host stateful services like Redis when combined with a proxy layer and custom operators.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.