How to Build a Highly Available Redis Service with Sentinel – Step‑by‑Step Guide
This article explains why Redis needs high availability, defines failure scenarios, compares common HA solutions, and walks through four deployment patterns—from a single instance to a three‑Sentinel architecture—highlighting their trade‑offs and practical implementation details.
Redis is the most widely used in‑memory key‑value store for web applications, commonly handling session storage, caching hot data, simple message queues (LPUSH/BRPOP), and Pub/Sub systems.
Large internet companies typically expose Redis as a foundational service for internal teams.
Any provider must answer whether the service is highly available (HA); that is, it should continue serving or recover quickly after failures. Three typical failure types are:
Exception 1: A Redis process on a node crashes unexpectedly.
Exception 2: An entire node goes down (power loss, hardware failure).
Exception 3: Network communication between two nodes is broken.
Because each exception is a low‑probability event, HA design assumes that multiple such events occurring simultaneously are negligible. The system should tolerate a single‑point failure for a short period.
Common HA solutions include Keepalived, Codis, Twemproxy, and Redis Sentinel. For modest data volumes, the author chose the official Redis Sentinel solution.
Redis Sentinel monitors Redis instances and automatically promotes a replica to master when the current master fails, making the failure transparent to clients.
Solution 1: Single‑Node Redis (No Sentinel)
A single Redis instance is suitable for personal projects or development environments where the client connects directly to the server. This setup suffers from a single‑point failure: if the process or host crashes, the service becomes unavailable and any non‑persistent data is lost.
Solution 2: Master‑Slave with One Sentinel
To eliminate the single‑point failure, a replica (slave) is added on a second server, and a Sentinel process monitors both instances. If the master fails, Sentinel promotes the slave to master. Clients query Sentinel to discover the current master before issuing commands. However, the Sentinel itself is a single point of failure; if it crashes, clients cannot obtain master information.
Solution 3: Master‑Slave with Two Sentinels
Running two Sentinel processes on separate machines allows clients to contact either one. The design assumes that at least 50% of Sentinels must be reachable to perform a failover. In a network partition where one whole server (and its Sentinel) is down, only one Sentinel remains reachable, which is insufficient for quorum, so no failover occurs. Allowing failover with ≤50% quorum would cause split‑brain scenarios, leading to data inconsistency.
Solution 4: Master‑Slave with Three Sentinels
Adding a third server with an additional Sentinel yields three Sentinels managing two Redis instances. This configuration tolerates any single‑process failure, any single‑machine failure, or a two‑machine network partition while still providing service. Optionally, a third Redis instance can be added to form a 1‑master + 2‑slave topology for extra redundancy, though more slaves increase replication overhead.
When a server loses network connectivity, the remaining Sentinels promote the surviving slave to master, temporarily resulting in two masters. To avoid data loss during the outage, Redis can be configured with min‑slaves‑to‑write and min‑slaves‑max‑lag so that a master stops accepting writes if it cannot confirm enough healthy replicas.
For client simplicity, a virtual IP (VIP) can be assigned to the current master. A failover script moves the VIP to the new master, allowing clients to continue using a single IP and port as if they were connecting to a standalone Redis instance.
In production, the author also runs supervisor to monitor Redis and Sentinel processes, automatically restarting them on unexpected exits.
Overall, building a highly available Redis service requires moving from a single instance to a multi‑Sentinel architecture, carefully considering quorum rules, network partitions, and client connection simplicity.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
IT Architects Alliance
Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
