Databases 13 min read

Zero‑Downtime Redis Cluster Expansion in Production

This guide details a step‑by‑step, zero‑downtime expansion of a 3‑master‑3‑slave Redis Cluster to a 4‑master‑4‑slave setup, covering node standardization, network checks, big‑key handling, full backups, monitoring, slot migration planning, progressive migration methods, replica addition, post‑expansion validation, rollback procedures, and practical lessons learned.

Linux Cloud-Native Ops Stack
Linux Cloud-Native Ops Stack
Linux Cloud-Native Ops Stack
Zero‑Downtime Redis Cluster Expansion in Production

Scenario

Expand a Redis Cluster from 3 masters + 3 replicas (6 nodes) to 4 masters + 4 replicas (8 nodes) with zero downtime, read/write latency variation <10%, no data loss, and fully automated execution.

Core Expansion Principle

Redis Cluster owns 16 384 hash slots. Adding nodes and progressively migrating slots enables low‑impact scaling.

1. Pre‑Expansion Preparation

1.1 Deploy new nodes

New instances must match the existing cluster in Redis version, hardware (memory, CPU, disk), OS version, kernel parameters (overcommit_memory, swappiness, THP), and Redis configuration (only ports 7002‑7003 differ).

# Create directories
mkdir -p /app/redis-cluster/7002/{data,log}
chown -R redis:redis /app/redis-cluster

# Copy redis.conf and adjust ports 7002‑7003
# Start Redis
su -s /bin/bash redis -c "/app/redis/bin/redis-server /app/redis-cluster/7002/redis.conf"
su -s /bin/bash redis -c "/app/redis/bin/redis-server /app/redis-cluster/7003/redis.conf"

1.2 Verify network connectivity

# Test TCP connectivity
telnet 192.168.1.200 7000

# Test cluster bus
redis-cli -a redis123 -h 192.168.1.200 -p 7002 ping
redis-cli -a redis123 -h 192.168.1.200 -p 7003 ping

1.3 Scan and handle big keys

Keys larger than 10 MB can block slot migration for seconds to minutes.

# Scan big keys on all masters
redis-cli -a redis123 --bigkeys -i 0.1 > bigkeys.log

# Typical handling
# 1. Split large hash/set/list into smaller keys
# 2. Delete unused large keys
# 3. For keys >100 MB, migrate via application code

1.4 Full backup on replicas

# Run BGSAVE on each replica
redis-cli -a Redis@Prod2026!6NodeStrongPass -h 192.168.1.203 bgsave
# Copy backup files to off‑site storage

1.5 Monitoring thresholds

CPU < 70 %

Memory < 80 %

Network bandwidth < 50 %

Redis average latency < 5 ms

Redis P99 latency < 20 ms

Replication lag < 500 ms

2. Low‑Impact Expansion Steps

2.1 Add new master nodes

redis-cli -a redis123 --cluster add-node \
192.168.1.202:7002 192.168.1.200:7000

Verify the node appears as master with no slots assigned:

redis-cli -c -a redis123 cluster nodes

2.2 Calculate slot migration plan

Target: 4 masters → 4096 slots each (16384 ÷ 4). Existing masters hold ≈5461 slots, so each must give up 1365 slots.

2.3 Progressive slot migration

Production‑grade recommendation: manual batch migration, moving a small number of slots per round and observing load.

Method 1 (automatic bulk, low‑load clusters)

redis-cli -a redis123 --cluster reshard 192.168.1.200:7000 \
--cluster-to 2195e7e8cd64a6d7df7aed6df34b8e4602d488f6 \
--cluster-from 7ae117f0a7611dbdc5ce1119915c582c4f414946,3489925d874c6bdc7d720330a620701383132359,6292aeb504c432231aab205790b86a64a4ba0a39 \
--cluster-slots 4096 \
--cluster-yes

Method 2 (manual batch, preferred)

# Migrate 100 slots per batch, total 41 batches
for i in {1..41}; do
  echo "Batch $i: migrating 100 slots"
  redis-cli -a redis123 --cluster reshard 192.168.1.200:7000 \
    --cluster-to 2195e7e8cd64a6d7df7aed6df34b8e4602d488f6 \
    --cluster-from 7ae117f0a7611dbdc5ce1119915c582c4f414946,3489925d874c6bdc7d720330a620701383132359,6292aeb504c432231aab205790b86a64a4ba0a39 \
    --cluster-slots 100 \
    --cluster-yes
  sleep 10
done

2.4 Verify even slot distribution

redis-cli -a redis123 --cluster check 192.168.1.200:7000

Expected: each master holds ~4096 slots and all 16384 slots are covered.

2.5 Add replica nodes to the new masters

redis-cli --cluster add-node 192.168.1.200:7003 192.168.1.200:7000 \
  -a redis123 --cluster-slave --cluster-master-id 2195e7e8cd64a6d7df7aed6df34b8e4602d488f6

Verify the new node appears as slave linked to its master.

2.6 Validate read/write functionality

# Write test keys
redis-cli -c -a redis123 -h 192.168.1.200 -p 7000 set test_expand_key_1 value1
redis-cli -c -a redis123 -h 192.168.1.200 -p 7000 set test_expand_key_2 value2
redis-cli -c -a redis123 -h 192.168.1.200 -p 7000 set test_expand_key_3 value3

# Read test keys
redis-cli -c -a redis123 get test_expand_key_1
redis-cli -c -a redis123 get test_expand_key_2
redis-cli -c -a redis123 get test_expand_key_3

# Verify key slots
redis-cli -c -a redis123 cluster keyslot test_expand_key_1
redis-cli -c -a redis123 cluster keyslot test_expand_key_2
redis-cli -c -a redis123 cluster keyslot test_expand_key_3

3. Post‑Expansion Optimization & Verification

3.1 Load‑balancing check

redis-cli -a redis123 --cluster info 192.168.1.200:7000

3.2 Client configuration

Update client node lists to include the new addresses for better connection efficiency and availability.

3.3 Monitoring adjustments

Add new node metrics to Prometheus scrape configuration.

Extend Grafana dashboards to display new node status.

Add alerting rules for the new nodes.

4. Core Technical Principles

4.1 Redirection mechanisms

MOVED : slot fully migrated; client permanently updates routing table.

ASK : slot in migration and key already moved; client sends ASKING then retries; routing table not permanently updated.

4.2 Atomicity guarantees

Key migration uses the MIGRATE command, which is atomic.

Source node deletes the key only after the target confirms successful storage.

No data loss or inconsistency occurs during migration.

4.3 Progressive migration

Keys inside a slot are moved in small batches.

Source node continues to serve reads/writes during migration.

Migration speed can be throttled via parameters to limit performance impact.

5. Expansion Rollback Plan

Pause all migration operations.

Migrate slots that have already moved back to their original masters.

Remove the newly added master and replica nodes from the cluster.

Verify the cluster returns to a healthy state.

6. Practical Takeaways

Plan capacity growth 3‑6 months ahead.

Deploy new nodes with identical version, hardware, OS, kernel parameters, and Redis configuration.

Scan and handle all large keys before migration.

Prefer off‑peak execution (e.g., 02:00‑04:00).

Use batch slot migration and monitor load after each batch.

Continuously monitor the metrics listed in section 1.5.

Validate backups before starting.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringRedisClusterZero DowntimeHash SlotsExpansion
Linux Cloud-Native Ops Stack
Written by

Linux Cloud-Native Ops Stack

Focused on practical internet operations, sharing server monitoring, troubleshooting, automated deployment, and cloud-native tech insights. From Linux basics to advanced K8s, from ops tools to architecture optimization, helping engineers avoid pitfalls, grow quickly, and become your tech companion.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.