Master Redis Cluster Ops in 5 Minutes: Fix 90% of Production Issues
This comprehensive guide walks you through Redis cluster architecture, deployment, enterprise‑grade configuration management, logging, monitoring, queue handling, performance tuning, and fault‑tolerance practices, providing step‑by‑step instructions, scripts, and best‑practice recommendations to quickly resolve the majority of production problems.
Redis Cluster Operations Tool: 5‑Minute Guide to Solving 90% of Production Failures
Redis Cluster Architecture Principles
Redis Cluster Deployment Configuration
Enterprise‑Grade Configuration Management
Log Management and Monitoring
Queue Settings and Management
Performance Optimization and Tuning
Fault Handling and Operations Practice
Cluster Mode Overview
Redis Cluster is a distributed Redis solution that provides high availability and horizontal scalability through data sharding and master‑slave replication. The cluster splits the entire keyspace into 16,384 hash slots, each node being responsible for a subset of slots.
Cluster node distribution example:
Master-1 (0-5460) Master-2 (5461-10922) Master-3 (10923-16383)
| | |
Slave-1 Slave-2 Slave-3Data Sharding Principle
Redis uses the CRC16 algorithm to hash a key, then takes the modulo 16384 to determine the slot where the key should be stored:
HASH_SLOT = CRC16(key) mod 16384Fault Detection and Migration
The cluster uses the Gossip protocol for node communication. When a master node fails, its replica is automatically promoted to master, ensuring high availability.
Redis Cluster Deployment Configuration
Environment Preparation
System Requirements
Linux distribution: CentOS 7+, Ubuntu 18.04+
Redis version: 5.0+ (6.2+ recommended)
Minimum memory: 2 GB per node
Network latency between nodes < 1 ms
Server Planning
# 6‑node cluster plan (3 masters, 3 slaves)
192.168.1.10:7000 # Master‑1
192.168.1.11:7000 # Slave‑1
192.168.1.12:7000 # Master‑2
192.168.1.13:7000 # Slave‑2
192.168.1.14:7000 # Master‑3
192.168.1.15:7000 # Slave‑3System Optimization Configuration
Kernel Parameter Tuning
# /etc/sysctl.conf
vm.overcommit_memory = 1
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 65535
vm.swappiness = 0System Limits Configuration
# /etc/security/limits.conf
redis soft nofile 65535
redis hard nofile 65535
redis soft nproc 65535
redis hard nproc 65535Transparent Hugepage Disable
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag
# make permanent
echo 'echo never > /sys/kernel/mm/transparent_hugepage/enabled' >> /etc/rc.local
echo 'echo never > /sys/kernel/mm/transparent_hugepage/defrag' >> /etc/rc.localRedis Installation and Configuration
Compile and Install Redis
# Install dependencies
yum install -y gcc gcc-c++ make
# Download source
wget http://download.redis.io/releases/redis-6.2.7.tar.gz
tar xzf redis-6.2.7.tar.gz
cd redis-6.2.7
# Build and install
make PREFIX=/usr/local/redis install
# Create user and directories
useradd -r -s /bin/false redis
mkdir -p /usr/local/redis/{conf,data,logs}
chown -R redis:redis /usr/local/redisCluster Configuration File
# /usr/local/redis/conf/redis-7000.conf
bind 0.0.0.0
port 7000
daemonize yes
pidfile /var/run/redis_7000.pid
logfile /usr/local/redis/logs/redis-7000.log
dir /usr/local/redis/data
# Cluster settings
cluster-enabled yes
cluster-config-file nodes-7000.conf
cluster-node-timeout 15000
cluster-announce-ip 192.168.1.10
cluster-announce-port 7000
cluster-announce-bus-port 17000
# Memory settings
maxmemory 2gb
maxmemory-policy allkeys-lru
# Persistence
save 900 1
save 300 10
save 60 10000
appendonly yes
appendfilename "appendonly-7000.aof"
appendfsync everysec
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
# Security
requirepass "your_redis_password"
masterauth "your_redis_password"
# Network
tcp-keepalive 60
timeout 300
tcp-backlog 511Systemd Service
# /etc/systemd/system/redis-7000.service
[Unit]
Description=Redis In-Memory Data Store (Port 7000)
After=network.target
[Service]
User=redis
Group=redis
ExecStart=/usr/local/redis/bin/redis-server /usr/local/redis/conf/redis-7000.conf
ExecStop=/usr/local/redis/bin/redis-cli -p 7000 shutdown
Restart=always
[Install]
WantedBy=multi-user.targetCluster Initialization
Start All Nodes
# Start Redis on all nodes
systemctl start redis-7000
systemctl enable redis-7000
# Verify status
systemctl status redis-7000Create Cluster
# Using redis-cli
/usr/local/redis/bin/redis-cli --cluster create \
192.168.1.10:7000 192.168.1.11:7000 192.168.1.12:7000 \
192.168.1.13:7000 192.168.1.14:7000 192.168.1.15:7000 \
--cluster-replicas 1 -a your_redis_password
# Or using redis-trib.rb (Redis 5.0 and earlier)
./redis-trib.rb create --replicas 1 \
192.168.1.10:7000 192.168.1.11:7000 192.168.1.12:7000 \
192.168.1.13:7000 192.168.1.14:7000 192.168.1.15:7000Validate Cluster State
# Check cluster info
redis-cli -c -h 192.168.1.10 -p 7000 -a your_redis_password cluster info
redis-cli -c -h 192.168.1.10 -p 7000 -a your_redis_password cluster nodesEnterprise Configuration Management
Configuration Templatization
Ansible Configuration Management
# redis-cluster-playbook.yml
---
- hosts: redis_cluster
become: yes
vars:
redis_port: 7000
redis_password: "{{ vault_redis_password }}"
redis_maxmemory: "{{ ansible_memtotal_mb // 2 }}mb"
tasks:
- name: Install Redis dependencies
yum:
name: "{{ item }}"
state: present
loop:
- gcc
- gcc-c++
- make
- name: Create Redis user
user:
name: redis
system: yes
shell: /bin/false
- name: Create Redis directories
file:
path: "{{ item }}"
state: directory
owner: redis
group: redis
mode: '0755'
loop:
- /usr/local/redis/conf
- /usr/local/redis/data
- /usr/local/redis/logs
- name: Deploy Redis configuration
template:
src: redis.conf.j2
dest: /usr/local/redis/conf/redis-{{ redis_port }}.conf
owner: redis
group: redis
mode: '0640'
notify: restart redis
- name: Deploy systemd service
template:
src: redis.service.j2
dest: /etc/systemd/system/redis-{{ redis_port }}.service
notify: reload systemd
handlers:
- name: reload systemd
systemd:
daemon_reload: yes
- name: restart redis
systemd:
name: redis-{{ redis_port }}
state: restartedConfiguration Template (redis.conf.j2)
bind 0.0.0.0
port {{ redis_port }}
daemonize yes
pidfile /var/run/redis_{{ redis_port }}.pid
logfile /usr/local/redis/logs/redis-{{ redis_port }}.log
dir /usr/local/redis/data
# Cluster
cluster-enabled yes
cluster-config-file nodes-{{ redis_port }}.conf
cluster-node-timeout 15000
cluster-announce-ip {{ ansible_default_ipv4.address }}
cluster-announce-port {{ redis_port }}
cluster-announce-bus-port {{ redis_port | int + 10000 }}
# Memory
maxmemory {{ redis_maxmemory }}
maxmemory-policy allkeys-lru
# Persistence
save 900 1
save 300 10
save 60 10000
appendonly yes
appendfilename "appendonly-{{ redis_port }}.aof"
appendfsync everysec
# Security
requirepass "{{ redis_password }}"
masterauth "{{ redis_password }}"
# Network
tcp-keepalive 60
timeout 300
tcp-backlog 511Configuration Version Control
# Initialize config repository
mkdir /opt/redis-config
cd /opt/redis-config
git init
# Directory layout
mkdir -p environments/{dev,test,prod} templates scripts monitoring
# Example environment file (prod)
# environments/prod/group_vars/all.yml
redis_cluster_nodes:
- host: 192.168.1.10
port: 7000
role: master
- host: 192.168.1.11
port: 7000
role: slaveLog Management and Monitoring
Log Configuration and Classification
Log Level Configuration
# Redis log level
# debug – verbose, for development
# verbose – many details
# notice – moderate, suitable for production
# warning – only important messages
loglevel notice
logfile /usr/local/redis/logs/redis-7000.log
syslog-enabled yes
syslog-ident redis-7000
syslog-facility local0Log Rotation
# /etc/logrotate.d/redis
/usr/local/redis/logs/*.log {
daily
missingok
rotate 30
compress
delaycompress
notifempty
create 640 redis redis
postrotate
/bin/kill -USR1 `cat /var/run/redis_7000.pid 2>/dev/null` 2>/dev/null || true
endscript
}Monitoring Metrics Collection
Prometheus Monitoring Configuration
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'redis-cluster'
static_configs:
- targets: ['192.168.1.10:9121','192.168.1.11:9121','192.168.1.12:9121']
scrape_interval: 10s
metrics_path: /metricsRedis Exporter Deployment
# Download Redis Exporter
wget https://github.com/oliver006/redis_exporter/releases/download/v1.45.0/redis_exporter-v1.45.0.linux-amd64.tar.gz
tar xzf redis_exporter-v1.45.0.linux-amd64.tar.gz
cp redis_exporter /usr/local/bin/
# Systemd service
cat > /etc/systemd/system/redis-exporter.service <<'EOF'
[Unit]
Description=Redis Exporter
After=network.target
[Service]
Type=simple
User=redis
ExecStart=/usr/local/bin/redis-exporter \
-redis.addr=redis://localhost:7000 \
-redis.password=your_redis_password
Restart=always
[Install]
WantedBy=multi-user.target
EOF
systemctl start redis-exporter
systemctl enable redis-exporterKey Monitoring Metrics
# Memory usage
redis_memory_used_bytes
redis_memory_max_bytes
redis_memory_used_rss_bytes
# Connection count
redis_connected_clients
redis_blocked_clients
redis_rejected_connections_total
# Command stats
redis_commands_processed_total
redis_commands_duration_seconds_total
# Cluster status
redis_cluster_enabled
redis_cluster_nodes
redis_cluster_slots_assigned
redis_cluster_slots_ok
redis_cluster_slots_pfail
redis_cluster_slots_fail
# Replication
redis_replication_backlog_bytes
redis_replica_lag_seconds
redis_master_repl_offsetLog Analysis and Alerting
ELK Stack Integration (filebeat.yml)
# filebeat.yml
filebeat.inputs:
- type: log
enabled: true
paths:
- /usr/local/redis/logs/*.log
fields:
service: redis
environment: production
fields_under_root: true
output.logstash:
hosts: ["logstash:5044"]
processors:
- add_host_metadata:
when.not.contains.tags: forwardedLogstash Configuration (logstash-redis.conf)
# input
input {
beats {
port => 5044
}
}
filter {
if [service] == "redis" {
grok {
match => { "message" => "%{POSINT:pid}:%{CHAR:role} %{GREEDYDATA:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" }
}
date {
match => ["timestamp", "dd MMM yyyy HH:mm:ss.SSS"]
}
if [level] == "WARNING" or [level] == "ERROR" {
mutate { add_tag => ["alert"] }
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "redis-%{+YYYY.MM.dd}"
}
}Alert Rules (alertmanager-rules.yml)
# alertmanager-rules.yml
groups:
- name: redis.rules
rules:
- alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis instance is down"
description: "Redis instance {{ $labels.instance }} is down"
- alert: RedisHighMemoryUsage
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Redis memory usage is high"
description: "Redis memory usage is {{ $value }}%"
- alert: RedisHighConnectionCount
expr: redis_connected_clients > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Redis connection count is high"
description: "Redis has {{ $value }} connections"
- alert: RedisClusterNodeDown
expr: redis_cluster_nodes{state="fail"} > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis cluster node is down"
description: "Redis cluster has {{ $value }} failed nodes"Queue Settings and Management
Redis Queue Modes
List Queue Implementation
# Simple queue using List
LPUSH myqueue "message1"
LPUSH myqueue "message2"
# Consumer
RPOP myqueue
# Blocking consumption
BRPOP myqueue 0Stream Queue Implementation
# Create a stream
XADD mystream * field1 value1 field2 value2
# Consumer group
XGROUP CREATE mystream mygroup 0 MKSTREAM
# Consume messages
XREADGROUP GROUP mygroup consumer1 COUNT 1 STREAMS mystream >
# Acknowledge
XACK mystream mygroup message_idEnterprise Queue Configuration (redis-queue.conf)
# Basic settings
port 6379
bind 0.0.0.0
daemonize yes
pidfile /var/run/redis-queue.pid
logfile /usr/local/redis/logs/redis-queue.log
dir /usr/local/redis/data
# Memory (queues need more)
maxmemory 4gb
maxmemory-policy allkeys-lru
# Persistence (no data loss)
appendonly yes
appendfilename "appendonly-queue.aof"
appendfsync everysec
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
# Network
timeout 0
tcp-keepalive 300
tcp-backlog 511
# Client limits
maxclients 10000
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit replica 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
# Queue‑specific settings
list-max-ziplist-size -2
list-compress-depth 0
stream-node-max-bytes 4096
stream-node-max-entries 100Queue Monitoring Script (Python)
#!/usr/bin/env python3
import redis, json, time, logging
from datetime import datetime
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
class RedisQueueMonitor:
def __init__(self, host='localhost', port=6379, password=None):
self.redis_client = redis.Redis(host=host, port=port, password=password, decode_responses=True)
def monitor_list_queues(self, patterns):
stats = {}
for pattern in patterns:
for q in self.redis_client.keys(pattern):
length = self.redis_client.llen(q)
stats[q] = {'type':'list','length':length,'timestamp':datetime.now().isoformat()}
if length > 10000:
logging.warning(f"Queue {q} has {length} items")
return stats
def monitor_stream_queues(self, patterns):
stats = {}
for pattern in patterns:
for s in self.redis_client.keys(pattern):
try:
length = self.redis_client.xlen(s)
info = self.redis_client.xinfo_stream(s)
groups = self.redis_client.xinfo_groups(s)
stats[s] = {'type':'stream','length':length,'first_entry':info['first-entry'],'last_entry':info['last-entry'],'groups':len(groups),'timestamp':datetime.now().isoformat()}
for g in groups:
if g['lag'] > 1000:
logging.warning(f"Stream {s} group {g['name']} lag {g['lag']}")
except Exception as e:
logging.error(f"Error monitoring stream {s}: {e}")
return stats
def get_memory_usage(self):
info = self.redis_client.info('memory')
return {k:info[k] for k in ('used_memory','used_memory_human','used_memory_peak','used_memory_peak_human')}
def run_monitoring(self):
queue_patterns = ['task:*','job:*','message:*']
stream_patterns = ['stream:*','events:*']
while True:
try:
list_stats = self.monitor_list_queues(queue_patterns)
stream_stats = self.monitor_stream_queues(stream_patterns)
mem = self.get_memory_usage()
logging.info(json.dumps({'timestamp':datetime.now().isoformat(),'list_queues':list_stats,'stream_queues':stream_stats,'memory':mem}, indent=2))
time.sleep(60)
except Exception as e:
logging.error(f"Monitoring error: {e}")
time.sleep(10)
if __name__ == '__main__':
monitor = RedisQueueMonitor(host='localhost', port=6379, password='your_redis_password')
monitor.run_monitoring()Queue Optimization Configuration
Memory Optimization for Queues
# Use ziplist compression for lists
list-max-ziplist-size -2
list-compress-depth 1
# Stream optimization
stream-node-max-bytes 4096
stream-node-max-entries 100
# Eviction policy
maxmemory-policy allkeys-lruPersistence Optimization
# Disable RDB, enable AOF only
save ""
appendonly yes
appendfilename "appendonly-queue.aof"
appendfsync everysec
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-rewrite-incremental-fsync yesPerformance Optimization and Tuning
Cluster Performance Optimization
Slot Distribution Optimization
# Check slot distribution
redis-cli -c -h 192.168.1.10 -p 7000 -a password cluster slots
# Reshard slots
redis-cli --cluster reshard 192.168.1.10:7000 \
--cluster-from source_node_id \
--cluster-to target_node_id \
--cluster-slots 1000 \
--cluster-yesRead‑Write Separation
# Slave read‑only configuration
replica-read-only yes
# Clients should send writes to masters and reads to slaves (application‑level)Fault Handling and Operations Practice
Automatic Failover Script
#!/bin/bash
check_cluster_health() {
result=$(redis-cli -c -h $1 -p $2 -a $3 cluster info 2>/dev/null | grep "cluster_state:ok")
if [ -n "$result" ]; then
return 0
else
return 1
fi
}
if ! check_cluster_health "192.168.1.10" "7000" "password"; then
echo "Cluster unhealthy, triggering failover procedures..."
# Insert failover logic here
fiBackup and Restore
# Backup script
BACKUP_DIR="/backup/redis/$(date +%Y%m%d)"
mkdir -p $BACKUP_DIR
# Trigger RDB snapshot
redis-cli -h 192.168.1.10 -p 7000 -a password BGSAVE
# Copy AOF files
cp /usr/local/redis/data/appendonly*.aof $BACKUP_DIR/
# Backup cluster node info
redis-cli -h 192.168.1.10 -p 7000 -a password cluster nodes > $BACKUP_DIR/cluster-nodes.txtSummary
This article provides a complete, enterprise‑grade solution for operating Redis clusters on Linux, covering architecture design, standardized configuration management, comprehensive monitoring, automated operational scripts, and performance tuning to build a highly available and high‑performance Redis service.
Key Points
Architecture : 3‑master‑3‑slave standard cluster ensures high availability.
Configuration Management : Template‑based and version‑controlled configurations guarantee consistency.
Monitoring System : Full metric collection, log analysis, and alerting for proactive operations.
Queue Management : Choose appropriate queue models (List or Stream) for different scenarios.
Performance Tuning : Ongoing monitoring and parameter adjustments keep the system optimal.
Operational Recommendations
Regularly perform health checks and performance evaluations.
Establish robust backup and recovery mechanisms.
Define detailed fault‑handling procedures.
Continuously optimize configuration parameters and monitoring metrics.
Stay updated with new Redis features and best practices.
By following the best practices outlined in this guide, operations engineers can build and maintain a stable, efficient Redis cluster that reliably supports critical business workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
