Operations 11 min read

Why Did Microservices Drop After Zookeeper Restart? Session Mechanics & Fixes

A mistaken Zookeeper restart caused a 30‑minute outage of all microservices; this article analyzes the ZK session mechanism, why temporary nodes were not recreated, and presents two concrete solutions and best‑practice recommendations to prevent similar failures.

ITPUB
ITPUB
ITPUB
Why Did Microservices Drop After Zookeeper Restart? Session Mechanics & Fixes

1. Symptom

At around 19:43 a mistaken operation stopped six of seven Zookeeper nodes, causing the cluster to halt. After restarting the nodes at 19:51 the services kept running, but at 19:56 alarms showed massive call failures. The RPC framework attempted automatic recovery but nothing recovered for about eight minutes.

2. Initial Analysis

Our RPC framework uses a typical registry‑center + provider + consumer model, registering services as temporary ZNodes in Zookeeper.

Stage 1: While the ZK cluster was stopped, consumers could not reach ZK, so they kept using their cached provider list and no errors occurred.

Stage 2: After the cluster started, consumers immediately performed service discovery, but providers had not re‑registered yet, resulting in an empty address list and batch errors.

Stage 3: For about 40 seconds after the cluster recovered, providers still did not “auto‑reconnect” to ZK, so consumers kept failing until all pods were restarted, which forced providers to re‑register.

The core problem is that after ZK recovery the RPC client does not retry creating temporary nodes, and ZK later deletes those nodes when the session expires.

3. Deep Investigation

3.1 Reproducing the Issue

Testing showed that a ZK session expiration can be of two types: server‑side expiration and client‑side expiration. In the client‑side case, recovering the cluster causes loss of temporary nodes and no automatic recovery.

3.2 Root‑Cause Analysis

When the cluster recovers, the RPC client immediately reconnects and tries to recreate the provider and consumer temporary nodes.

The recreation fails with NodeExistsException because the nodes already exist from the previous session; the client swallows the exception, treating the operation as successful, so no retry occurs.

About 40 seconds after recovery, ZK removes the expired session’s temporary nodes (the server binds temporary nodes to the session ID and deletes them when the session times out).

Consumers receive an empty node list, clear their local provider cache, and the outage occurs.

3.3 Zookeeper Session Mechanism

Both client and server maintain session IDs and heartbeat. If a client sends an unknown session ID, the server creates a new one. The client library (Curator) resets the local session ID to 0 on timeout, causing a new session to be created on reconnection.

The server’s session manager periodically checks for expired sessions and deletes the associated temporary nodes.

4. Summary of Root Causes

The recovered cluster reads the old snapshot, recreates the previous session, and renews it for 40 seconds.

The client’s session had already expired, so it reconnects with a new session ID; the RPC framework ignores the NodeExistsException, assuming registration succeeded.

After 40 seconds the server’s session timeout expires, and all temporary nodes bound to the old session are removed.

Consumers detect the empty node list and clear their provider cache, causing the failure.

5. Solutions

Increase the client (Curator) session timeout or disable expiration so the original session ID remains valid during the first 40 seconds after recovery.

Handle NodeExistsException on reconnection by deleting the stale node and recreating it, ensuring the new session registers successfully.

Dubbo adopts the second solution; we also apply a delete‑then‑create strategy to replace the stale ZNode.

6. Best Practices

In addition to improving exception handling, RPC frameworks should implement “empty‑push protection”: when service discovery receives an empty node list, keep the previous cached list instead of clearing it.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MicroservicesOperationsRPCservice discoveryZooKeepersession timeout
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.