Handling Connection Loss and Session Expiry in ZooKeeper-based Distributed Locks
When using ZooKeeper‑based distributed locks, applications must detect CONNECTIONLOSS and SESSIONEXPIRED events, verify the master znode to confirm leadership, relinquish duties on session expiry, and design idempotent tasks to avoid split‑brain or duplicate execution caused by network partitions or quorum delays.
Distributed locks are increasingly used to coordinate concurrent tasks, but unsafe usage can lead to multiple masters acting in parallel, causing side effects such as duplicate computations.
To accurately acquire a distributed lock and capture its dynamic transfer state in the presence of network changes, one must handle events like session expiration and connection loss.
When a connection loss occurs, the commit may have succeeded or failed; the application must check the state of the znode to determine leadership. The following code shows how the runForMaster method processes the CONNECTIONLOSS case by invoking checkMaster, which retries until a definitive state is known:
protected void runForMaster() { logger.info("master:run for master."); AsyncCallback.StringCallback createCallback = (rc, path, ctx, name) -> { switch (KeeperException.Code.get(rc)) { case CONNECTIONLOSS: checkMaster(); //链接失效检查znode设置是否成功 return; case OK: isLeader = true; logger.info("master:I'm the leader serverId:" + serverId); addMasterWatcher(); //监控 master znode this.takeLeadership(); //执行leader权利 break; case NODEEXISTS: isLeader = false; String serverId = this.getMasterServerId(); this.takeBackup(serverId); break; } }; zk.create(rootPath + "/master", serverId.getBytes(), ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL, createCallback, null);} private void checkMaster() { AsyncCallback.DataCallback masterCheckCallback = (rc, path, ctx, data, stat) -> { switch (KeeperException.Code.get(rc)) { case CONNECTIONLOSS: checkMaster(); return; case NONODE: runForMaster(); return; default: { String serverId = this.getMasterServerId(); isLeader = serverId.equals(this.serverId); if (!BooleanUtils.isTrue(isLeader)) { this.takeBackup(serverId); } else { this.takeLeadership(); } } return; } }; zk.getData(masterZnode, false, masterCheckCallback, null);}Session expiry triggers a SESSIONEXPIRED event, requiring a stop‑point strategy to relinquish master duties and stop any ongoing work:
case SESSIONEXPIRED: //执行 stop point 通知 this.stopPoint(); break;Some implementations bypass ZooKeeper for state notification, but this is unsafe because ZooKeeper’s internal quorum and replication delays can cause the client’s view of lock ownership to diverge from the cluster’s actual state.
Leader election and ZK node disconnections can cause session loss across data centers. For example, a seven‑node cluster spread across Shanghai, Beijing, and Shenzhen may lose its leader (zid=100) due to a network partition; the new leader (zid=80) holds no session information, causing all sessions to expire and leading to potential split‑brain scenarios.
Both static and dynamic cluster expansion carry risks: static expansion may overwrite logs if a lagging node restarts slowly, while dynamic expansion can produce a new epoch when a subset of nodes forms a quorum, again risking data inconsistency.
To mitigate these issues, applications should design tasks to be idempotent—using version numbers, timestamps, or other checks—so that repeated execution does not corrupt data or produce incorrect results.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
