How to Build Redis Master‑Slave Replication with Sentinel and Enable Automatic Failover
Learn step‑by‑step how to configure Redis master‑slave replication, set up Sentinel for health monitoring, automate failover, secure the deployment, tune performance, troubleshoot common issues, and integrate monitoring with Prometheus and Grafana, ensuring high‑availability for production workloads.
Overview
This guide explains how to turn a single‑node Redis instance into a production‑grade high‑availability service using master‑slave replication and Sentinel for automatic failover. It covers the underlying replication protocol, required configuration files, client integration, monitoring, troubleshooting, performance tuning, security hardening, and operational runbooks.
Redis Replication Mechanics
Redis replication is a hybrid push‑pull model. A slave connects to the master and sends PSYNC. The master decides whether to perform a full sync (RDB snapshot + command replay) or a partial sync (send only the missing entries from the repl_backlog buffer).
Key fields: replid – the replication ID. It changes on master restart, forcing a full sync. offset – the replication offset. The difference master_repl_offset - slave_repl_offset indicates how far the slave lags.
Typical INFO replication output shows role, master_link_status, master_last_io_seconds_ago, and the offsets.
Sentinel Architecture
Sentinel runs as an independent process (usually three or five instances). Its responsibilities are:
Continuously PING master and slaves to detect failures (monitoring).
Notify administrators via scripts (notification).
Perform automatic failover when a quorum of sentinels agrees the master is down (ODOWN).
Provide the current master address to clients (configuration provider).
Sentinel distinguishes two failure states:
Subjective down (SDOWN) – a single sentinel marks the master as down after down-after-milliseconds timeout.
Objective down (ODOWN) – a majority (quorum) of sentinels agree, triggering failover.
Deployment Steps
Install Redis (e.g., yum install -y redis or compile from source).
Create a shared configuration fragment redis-common.conf with common settings (bind, port, persistence, memory limits, security).
Configure the master ( 6379) without replicaof. Enable replica-read-only yes and replica-serve-stale-data yes.
Configure each slave ( 6380, 6381) with replicaof 127.0.0.1 6379 and the same security settings.
Write Sentinel configuration files ( sentinel.conf) for each instance, e.g.:
bind 127.0.0.1
port 26379
daemonize yes
logfile /var/log/redis/sentinel.log
dir /var/lib/redis
sentinel monitor redis-ha 127.0.0.1 6379 2
sentinel auth-pass redis-ha <PASSWORD>
sentinel down-after-milliseconds redis-ha 5000
sentinel parallel-syncs redis-ha 1
sentinel failover-timeout redis-ha 60000Start all Redis instances and Sentinel processes. Verify with redis-cli INFO replication and redis-cli -p 26379 SENTINEL get-master-addr-by-name redis-ha.
Client Integration
Never hard‑code Sentinel IP/port for writes. Use the Sentinel API to discover the current master. Example snippets:
# Java (Lettuce)
RedisClient client = RedisClient.create();
StatefulRedisSentinelConnection<String, String> sentinelConn = client.connectSentinel(
RedisURI.builder()
.withSentinel("10.20.0.11", 26379)
.withSentinel("10.20.0.12", 26380)
.withSentinel("10.20.0.13", 26381)
.withMasterId("redis-ha")
.withPassword("<PASSWORD>".toCharArray())
.build());
RedisCommands<String, String> cmd = client.connect().sync();
cmd.set("key", "value"); # Python (redis‑py)
from redis.sentinel import Sentinel
sentinel = Sentinel([('10.20.0.11', 26379), ('10.20.0.12', 26380)], password='<PASSWORD>')
master = sentinel.master_for('redis-ha')
master.set('foo', 'bar') # Go (go‑redis v9)
client := redis.NewFailoverClient(&redis.FailoverOptions{
MasterName: "redis-ha",
SentinelAddrs: []string{"10.20.0.11:26379", "10.20.0.12:26380", "10.20.0.13:26381"},
Password: "<PASSWORD>",
})
client.Set(ctx, "foo", "bar", 0)All three libraries automatically reconnect to the new master after a failover.
Monitoring & Alerting
Deploy redis_exporter on each node and scrape it with Prometheus. Essential metrics include: redis_up – instance health. redis_connected_slaves – number of online replicas. redis_master_repl_offset and redis_slave_repl_offset – replication lag. used_memory and mem_fragmentation_ratio – memory usage. rejected_connections – maxclients exhaustion.
Sample Alertmanager rules (simplified):
- alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis instance {{ $labels.instance }} is down"
- alert: RedisReplicationLag
expr: (redis_master_repl_offset - redis_slave_repl_offset) > 1048576
for: 5m
labels:
severity: warning
annotations:
summary: "Replication lag exceeds 1 MiB on {{ $labels.instance }}"Grafana dashboards (e.g., ID 11835) visualize these metrics.
Troubleshooting Scenarios
Slave disconnect – check master_link_status and master_link_down_since_seconds. Increase repl-backlog-size (≥1 GiB) and verify network/firewall rules.
Sentinel cannot elect a new master – ensure at least three sentinels, correct quorum (N/2 + 1), and that all sentinels can ping each other (open port 26379). Use SENTINEL reset redis-ha to clear stale state.
Data loss after split‑brain – enable min-replicas-to-write 1 and min-replicas-max-lag 10 so the master rejects writes when it cannot confirm a healthy replica.
Clients still connected to old master after failover – use client libraries that support Sentinel‑driven reconnection (Lettuce, go‑redis, redis‑py). For custom clients, listen for MASTERDOWN errors and re‑resolve the master address.
All cases include concrete redis-cli commands to inspect INFO replication, SENTINEL masters, and logs.
Performance & Capacity Planning
Run redis-benchmark to obtain baseline QPS. In production, provision only 25‑50 % of the measured maximum to leave headroom for latency spikes and background tasks.
Memory sizing steps:
Estimate data growth (current size × 1.5 × 1.5 for safety).
Reserve 30 % of RAM for repl-backlog, client buffers, and lazy‑free overhead.
Set maxmemory to ≤ 80 % of physical RAM (e.g., 50 GiB on a 64 GiB machine).
Network bandwidth: each client at 100 bytes per command and 10 K QPS consumes ~1 MiB/s. Ensure NICs of at least 10 Gbps and separate VLANs for master‑slave traffic.
Security Hardening
Use ACLs instead of the legacy requirepass:
# Create a read‑only user
ACL SETUSER readonly on >readonlypass ~* &* +@read
# Create a write user limited to the application keyspace
ACL SETUSER appuser on >apppass ~app:* &* +@read +@write -@dangerous
# Disable the default user
ACL SETUSER default offNetwork hardening:
Bind Redis to internal IPs only (e.g., bind 10.20.0.11).
Use firewall rules to allow traffic only from trusted subnets.
Optionally enable TLS (port 6380) with a self‑signed certificate.
Rename or disable dangerous commands ( FLUSHALL, CONFIG, DEBUG, SHUTDOWN, KEYS).
Upgrade & Rollback Procedures
Redis 7 introduces ACLs, RESP3, and multi‑threaded I/O. Upgrade path:
Upgrade to Redis 6.x first (preserves RDB compatibility).
Migrate requirepass to ACL users.
Upgrade to Redis 7.x, test INFO output, and verify that Sentinel still discovers the master.
Rollback steps (e.g., from 7.0.5 to 7.0.4): stop the affected instance, restore data directory from backup, start the older binary, and run SENTINEL reset redis-ha so sentinels rediscover the node.
Operational Runbook & Checklist
Typical runbook actions (monthly):
Announce maintenance window.
Backup redis.conf and data directory.
Execute redis-cli -p 26379 SENTINEL failover redis-ha to test automatic promotion.
Verify new master with SENTINEL get-master-addr-by-name and check INFO replication on all slaves.
Monitor redis_up and redis_connected_slaves for at least 5 minutes.
Document any deviations and update the checklist.
Pre‑deployment checklist (excerpt):
- repl-backlog-size >= 1GiB
- min-replicas-to-write 1 & min-replicas-max-lag 10
- requirepass and masterauth are identical
- Deploy 3 or 5 Sentinel nodes with quorum = N/2 + 1
- Clients use Lettuce / go‑redis v9 / redis‑py (Sentinel aware)
- Prometheus + redis_exporter and Sentinel metrics are scraped
- Alert thresholds tuned to business SLAs
- Perform at least one manual failover test
- Verify firewall allows 6379 (Redis) and 26379 (Sentinel) traffic only between trusted hosts
- Disable transparent_hugepage and set vm.overcommit_memory=1
- Enable AOF everysec and ensure disk has 2× free spaceFollowing this guide ensures a robust, observable, and secure Redis HA deployment ready for production workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
