Designing a High‑Availability Redis Service with Sentinel: Architecture and Implementation
This article explains how to build a highly available Redis service using Redis Sentinel, covering failure scenarios, design principles, four architectural options, the final three‑Sentinel solution with virtual IP, and practical operational tips for ensuring continuous service.
In modern web development, in‑memory Redis is widely used as a key‑value store for session storage, hot‑data caching, simple message queues, and Pub/Sub systems. Large internet companies typically expose Redis as a foundational service to internal applications.
When providing such a service, callers inevitably ask whether it is highly available. High availability (HA) means the service can continue operating or recover quickly after various failures, such as a single process crash, an entire node outage, or a network partition.
The article defines three typical failure scenarios:
Exception 1: A Redis process on a node is killed.
Exception 2: An entire server goes down (power loss or hardware failure).
Exception 3: Communication between two nodes is broken.
HA design assumes that the probability of multiple independent failures occurring simultaneously is negligible; the system should tolerate a single‑point failure for a short period.
Several HA solutions exist (Keepalived, Codis, Twemproxy, Redis Sentinel). For a small‑scale deployment the author chose the official Redis Sentinel solution.
Solution 1: Single‑instance Redis without Sentinel
This setup is suitable only for learning or personal projects; a single point of failure means the service becomes unavailable if the Redis process or host crashes, and data may be lost without persistence.
Solution 2: Master‑Slave Redis with a single Sentinel
A master provides service while a slave replicates data, and a Sentinel monitors both instances to promote the slave if the master fails. Clients query Sentinel to discover the current master. However, the Sentinel itself is a single point of failure, so this architecture does not achieve true HA.
Solution 3: Master‑Slave Redis with two Sentinel instances
Adding a second Sentinel allows clients to fall back to the other Sentinel if one fails. Nevertheless, if an entire server (e.g., server 1) goes down, only one Sentinel remains reachable, which is insufficient to achieve the required >50% quorum for master election, leaving the service unavailable.
Solution 4: Master‑Slave Redis with three Sentinel instances (final architecture)
By introducing a third server running an additional Sentinel, the system can tolerate any single‑node failure, any two‑node network partition, or any single Sentinel crash while still providing service. Optionally, a third Redis instance can be added to form a 1‑master + 2‑slave topology for extra data redundancy.
To make client usage as simple as a single‑node Redis, a virtual IP (VIP) can be assigned to the current master. When a failover occurs, a callback script moves the VIP to the new master, allowing clients to continue connecting to the same IP and port.
In practice, the author also runs a supervisor to monitor processes and automatically restart them on crash, ensuring the HA Redis service remains operational.
Overall, building a highly available Redis service involves understanding failure modes, selecting the appropriate Sentinel quorum, and optionally using a virtual IP to hide the complexity from clients.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.