Cloud Native 20 min read

How Xiaomi Scaled Redis with Kubernetes: Deploying Redis Cluster on K8s

This article explains how Xiaomi migrated tens of thousands of Redis instances from bare‑metal servers to Kubernetes, using Redis Proxy, StatefulSets, and Ceph storage to achieve resource isolation, automated deployment, dynamic scaling, and improved reliability while addressing latency, IP‑change, and security challenges.

Efficient Ops
Efficient Ops
Efficient Ops
How Xiaomi Scaled Redis with Kubernetes: Deploying Redis Cluster on K8s

Background

Xiaomi operates tens of thousands of Redis instances with millions of billions of daily accesses, supporting almost all product lines. Previously all Redis nodes ran on physical machines without resource isolation, causing high operational overhead, frequent manual interventions for node failures, and unpredictable latency due to CPU contention.

Because CPU was not isolated, heavy RDB writes or traffic spikes could raise a node’s CPU usage and affect other clusters, increasing latency. The Redis Cluster protocol requires intelligent clients with many configuration parameters, which many developers struggle to set correctly, adding load to application servers.

Why Kubernetes

Resource Isolation

Physical deployments mixed many business lines, leading to CPU contention and difficult troubleshooting. Kubernetes allows explicit CPU requests and limits, improving utilization while preventing resource starvation.

Automated Deployment

On bare metal, deploying a new Redis Cluster required manual resource discovery, configuration edits, and running

redis_trib

, taking one to two hours. With Kubernetes StatefulSets and ConfigMaps, a new cluster can be provisioned in minutes.

How Kubernetes

Clients access Redis through an LVS VIP, which forwards requests to a Redis Proxy that then routes traffic to the Redis Cluster.

Redis Cluster Deployment

The cluster runs as a StatefulSet, persisting RDB/AOF files on Ceph Block Service via PVCs. Although Ceph adds 100‑200 ms of I/O latency, Redis writes are asynchronous, so the impact on service latency is negligible.

Proxy Selection

Open‑source proxies were evaluated; Codis and Twemproxy were excluded because they do not support Redis Cluster. The official

redis‑cluster‑proxy

lacks a stable release. Two candidates, Cerberus and Predixy, were benchmarked using

redis‑benchmark

on a 2‑core Proxy and 2‑core client, with three master nodes (1 CPU each). Results showed Predixy delivering 33‑60 % higher QPS and comparable latency, especially with larger keys, leading to its selection.

Predixy was further enhanced with dynamic backend switching, black‑/white‑listing, and audit logging.

Proxy Deployment

The proxy runs as a stateless Deployment behind a LoadBalancer, enabling easy horizontal scaling. Dynamic backend switching allows seamless cluster upgrades without client changes.

Proxy Autoscaling

Kubernetes HPA monitors average CPU usage; when it exceeds a threshold, the replica count is increased. After scaling, LVS detects new pods and redirects traffic. To avoid rapid oscillations, scaling down is suppressed for five minutes after a scale‑up event, and minimum/maximum pod counts are configurable.

Why Proxy

IP Changes on Pod Restart

Redis Cluster clients need fixed IP/Port endpoints; pod restarts in Kubernetes can change IPs, breaking client connections. The proxy hides cluster topology changes, exposing a stable VIP.

Connection Load

Before Redis 6.0, Redis handled most work in a single thread, so many client connections increased CPU usage and latency. Offloading connections to the proxy reduces Redis load.

Cluster Migration

Traditional physical‑machine migrations require manual data sync, service restarts, and risk. With a proxy, backend cluster creation, sync, and switch are invisible to clients; a simple command moves traffic to the new cluster.

Security

Clients authenticate only to the proxy, not directly to Redis, allowing separate passwords for administrators and users, and enabling operation restrictions (e.g., blocking

FLUSHDB

,

CONFIG SET

) and audit logging.

Proxy‑Induced Issues

Extra Hop Latency

The additional hop adds ~0.2‑0.3 ms latency, which is generally acceptable.

Pod IP Changes

Proxy pods also change IP on restart; the LVS configuration updates automatically to route traffic to new pods.

LVS Mapping Latency

Tests show LVS adds less than 0.1 ms for get/set operations of varying data sizes.

Benefits Brought by Kubernetes

Easy Deployment

Deploying via the operations platform’s K8s API dramatically speeds up provisioning.

Port Management

Each pod receives its own IP, eliminating the need for thousands of unique host ports and avoiding port exhaustion.

Lower Client Barrier

Applications can use a simple non‑smart client against the VIP, avoiding complex cluster‑aware client configuration.

Improved Client Performance

Offloading hashing and routing to the proxy reduces client CPU usage.

Dynamic Upgrade & Scaling

Proxy commands add or switch backend clusters without client awareness. Scaling from 30 to 60 nodes can be done with API‑driven creation, data sync, and a single proxy command, eliminating service restarts.

Stability & Resource Utilization

K8s native resource isolation allows mixed workloads while maintaining Redis’s high availability and latency requirements.

Encountered Problems

Pod Restart Data Loss

If a master pod restarts without AOF, in‑memory data is lost and only the last RDB snapshot is reloaded. The solution is to force a slave to failover before restart and rely on PVC‑backed persistence.

LVS Mapping Delay

To give LVS time to update mappings, a preStop hook sleeps for 171 seconds before pod termination.

StatefulSet Limitations

Native StatefulSet cannot guarantee that master‑slave pairs are on different nodes or that no more than a quarter of nodes share a host, leading to low utilization. A custom CRD,

RedisStatefulSet

, was created to apply anti‑affinity strategies and additional management features.

Summary

Several business units have migrated dozens of Redis clusters to Kubernetes, achieving lower operational effort and higher stability. While many challenges remain—such as data‑loss handling, IP volatility, and StatefulSet constraints—the migration demonstrates that Kubernetes can effectively host stateful services like Redis when combined with a proxy layer and custom operators.

cloud-nativeProxykubernetesRedisautoscalingCephStatefulSet
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.