How Redis Cluster Uses Gossip to Keep Nodes in Sync
This article explains why Redis clusters need distributed metadata, compares centralized and decentralized metadata storage, introduces the epidemic‑style Gossip protocol, and details how Redis Cluster implements Gossip messaging, node discovery, failure detection, and state synchronization through concrete code examples.
Cluster Mode and Gossip Overview
When a data store’s traffic or dataset exceeds the capacity of a single instance, a distributed solution is required. Redis adopts clustering to provide high availability, higher QPS, larger data volumes, and to avoid NIC bandwidth limits.
Single‑node cannot guarantee high availability; multiple instances are needed.
Single‑node handles up to ~80 K QPS; higher loads require more instances.
Data size per instance is limited; scaling out needs additional nodes.
Network traffic may exceed a server’s NIC capacity, requiring sharding.
Cluster metadata (instance IPs, slot mappings, etc.) can be stored either centrally (e.g., ZooKeeper) or in a decentralized fashion where each node holds part or all metadata and continuously exchanges updates. Redis Cluster uses the decentralized model.
Decentralized metadata can be synchronized with Paxos, Raft, or Gossip. Gossip differs from Paxos/Raft because it does not require a majority of nodes to be alive; instead, nodes randomly exchange state information until the whole cluster converges.
In a bounded network, if every node randomly exchanges a piece of information with another node, the knowledge across the cluster eventually converges to a consistent view.
Gossip (epidemic) spreads state, slot assignments, and master‑slave relationships via random, infection‑style exchanges, providing eventual consistency, elastic scaling, fast failure detection, and load‑balanced communication.
Redis Cluster Gossip Communication Mechanism
Redis Cluster (introduced in version 3.0) uses Gossip for all nodes to learn each other’s status. Each node maintains a local view containing:
Current cluster state.
Slot ownership and migration status.
Master‑slave relationships.
Node liveness and suspected FAIL status.
Message types exchanged are defined in cluster.h (Redis 4.0):
#define CLUSTERMSG_TYPE_PING 0 /* Ping message */
#define CLUSTERMSG_TYPE_PONG 1 /* Pong reply */
#define CLUSTERMSG_TYPE_MEET 2 /* Meet request */
#define CLUSTERMSG_TYPE_FAIL 3 /* Fail notification */
#define CLUSTERMSG_TYPE_PUBLISH 4 /* Pub/Sub broadcast */
#define CLUSTERMSG_TYPE_FAILOVER_AUTH_REQUEST 5
#define CLUSTERMSG_TYPE_FAILOVER_AUTH_ACK 6
#define CLUSTERMSG_TYPE_UPDATE 7
#define CLUSTERMSG_TYPE_MFSTART 8
#define CLUSTERMSG_TYPE_COUNT 9Through these messages, every instance can learn about new nodes, node failures, and slot changes, keeping the cluster state consistent.
Periodic PING/PONG
Each node periodically selects a random peer and sends a PING. The receiver replies with a PONG, allowing both sides to update their view of each other’s liveness and slot map.
Node Join Procedure
When a node runs CLUSTER MEET ip port, the existing node creates a clusterNode entry for the newcomer and sends a MEET message. The new node replies with a PONG, after which the original node sends a regular PING to complete the handshake. Subsequent periodic PING s propagate the newcomer’s presence to the rest of the cluster.
Failure Detection and Marking
If a node does not receive a PONG within cluster-node-timeout/2, it closes the TCP link and sends a fresh PING. When the timeout exceeds the full cluster-node-timeout, the node is marked PFAIL (possibly failing). Once a majority of nodes report the same PFAIL , the node is promoted to FAIL and the failure is broadcast.
Key Data Structures
Each instance holds a clusterState structure:
typedef struct clusterState {
clusterNode *myself;
dict *nodes; /* name → clusterNode */
clusterNode *slots[CLUSTER_SLOTS];
/* … other fields … */
} clusterState;Important fields of clusterNode include creation time, name, flags, config epoch, slot bitmap, master/slave pointers, ping/pong timestamps, IP/port, and a clusterLink representing the TCP connection.
typedef struct clusterNode {
mstime_t ctime; /* creation time */
char name[CLUSTER_NAMELEN];
int flags; /* role & status */
uint64_t configEpoch; /* node’s epoch */
unsigned char slots[CLUSTER_SLOTS/8];
int numslots;
int numslaves;
struct clusterNode **slaves;
struct clusterNode *slaveof;
mstime_t ping_sent;
mstime_t pong_received;
mstime_t fail_time;
char ip[NET_IP_STR_LEN];
int port;
int cport;
clusterLink *link;
list *fail_reports;
} clusterNode;The wire format is defined by clusterMsg, which contains a header (signature, version, type, sender ID, IP, ports, flags, epochs, offset) and a union clusterMsgData that holds the specific payload (PING, PONG, MEET, FAIL, etc.).
typedef struct {
char sig[4]; /* "RCmb" */
uint32_t totlen;
uint16_t ver;
uint16_t port;
uint16_t type;
uint16_t count;
uint64_t currentEpoch;
uint64_t configEpoch;
uint64_t offset;
char sender[CLUSTER_NAMELEN];
unsigned char myslots[CLUSTER_SLOTS/8];
char slaveof[CLUSTER_NAMELEN];
char myip[NET_IP_STR_LEN];
uint16_t cport;
uint16_t flags;
unsigned char state;
unsigned char mflags[3];
union clusterMsgData data;
} clusterMsg;Core Functions
clusterCron() runs every second. Every ten iterations it selects up to five random nodes, picks the one that has not responded longest, and calls clusterSendPing() with CLUSTERMSG_TYPE_PING. It also creates links for nodes without a TCP connection, sends MEET / PING to new nodes, and checks for stale connections or timeouts to trigger failure handling.
if (!(iteration % 10)) {
/* pick a node that hasn’t replied for the longest time */
clusterSendPing(min_pong_node->link, CLUSTERMSG_TYPE_PING);
}clusterSendPing() builds a clusterMsg header, selects a random subset of nodes to include as gossip entries (at least three, up to 1/10 of the cluster size), fills the payload, updates the sender’s ping_sent timestamp, and writes the packet to the socket.
void clusterSendPing(clusterLink *link, int type) {
int gossipcount = 0;
int wanted = floor(dictSize(server.cluster->nodes) / 10);
if (wanted < 3) wanted = 3;
/* build header */
clusterBuildMessageHdr(hdr, type);
/* select gossip nodes */
while (freshnodes > 0 && gossipcount < wanted) {
dictEntry *de = dictGetRandomKey(server.cluster->nodes);
clusterNode *this = dictGetVal(de);
if (this == myself || this->flags & CLUSTER_NODE_PFAIL) continue;
if (clusterNodeIsInGossipSection(hdr, gossipcount, this)) continue;
clusterSetGossipEntry(hdr, gossipcount, this);
gossipcount++;
freshnodes--;
}
/* send */
clusterSendMessage(link, buf, totlen);
}clusterBuildMessageHdr() fills static fields (signature, version, sender ID, IP, ports, flags, epochs, offset) and, for PING messages, records the current ping_sent timestamp.
void clusterBuildMessageHdr(clusterMsg *hdr, int type) {
memset(hdr, 0, sizeof(*hdr));
hdr->sig[0] = 'R'; hdr->sig[1] = 'C'; hdr->sig[2] = 'm'; hdr->sig[3] = 'b';
hdr->ver = htons(CLUSTER_PROTO_VER);
hdr->type = htons(type);
memcpy(hdr->sender, myself->name, CLUSTER_NAMELEN);
if (type == CLUSTERMSG_TYPE_PING && link->node)
link->node->ping_sent = mstime();
/* other fields such as IP, ports, slots, flags are set here */
}Putting It All Together
The combination of periodic PING/PONG heartbeats, random gossip dissemination, and explicit MEET / FAIL messages enables Redis Cluster to maintain a consistent view of node liveness and slot ownership without a central coordinator. Each node’s communication load grows only with the number of gossip entries it includes, allowing the protocol to scale to thousands of nodes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
