Cloud Native 21 min read

How ByConity Achieves Leader Election with Shared Storage and CAS

This article explains how ByConity uses a high‑availability shared KV store and CAS operations to implement a lightweight, fault‑tolerant leader election mechanism that eliminates the need for external services like Zookeeper, simplifies node management, and ensures safe leader transitions in a cloud‑native data warehouse.

Volcano Engine Developer Services
Volcano Engine Developer Services
Volcano Engine Developer Services
How ByConity Achieves Leader Election with Shared Storage and CAS

Background

In traditional share‑nothing micro‑service architectures, DNS is used for service discovery and components such as Zookeeper, Etcd or Consul provide leader‑follower election, which brings configuration and operational complexity.

With the rise of share‑everything storage‑compute separation, a high‑availability KV store can be used as shared memory, treating each compute node as a thread in a single‑machine model, enabling discovery and synchronization without extra services.

ByConity Basic Architecture

Reference to the article describing ByConity’s storage‑compute separation architecture based on ClickHouse.

Problems with ClickHouse‑keeper

At least three keeper nodes are required for fault tolerance.

Node addition/removal and service discovery are complex, requiring configuration changes on all nodes.

Container restarts change IP/port, making rapid recovery difficult because server_id and address are tightly bound.

Design Goals

Election component should be embeddable as a library, similar to pthread mutex.

Support arbitrary number of replica nodes.

Node addition/removal requires no extra operations.

Changing a node’s listening address requires no extra operations.

Election succeeds as long as one replica is available.

Replicas do not need to communicate with each other or synchronize clocks.

Leader Election Based on Shared Storage

Terminology

Replica : multiple equal instances of a service.

Business : service logic other than election.

Follower : replica that cannot provide business service.

Leader : replica that can provide business service.

Client : node that needs to access the leader.

Design Idea

The design mimics a pthread mutex: the lock is placed in shared memory, CAS is used for atomic writes, and the OS provides futex‑like wait/notify. By treating each election node as a thread and the KV store as shared memory, the node that wins the CAS becomes the leader.

Basic Rules

Each node is either follower or leader; at any moment only one node considers itself leader.

Any node can read a KV key to discover the current leader; if the key does not exist, no leader has been elected.

The leader periodically CAS‑updates the lease.last_refresh_time to extend its term.

The leader can voluntarily yield by setting lease.status to Yield.

Followers periodically GET the key; if the lease is expired or missing, they attempt a CAS to become leader.

Election Process

Detailed steps for candidate selection, winning, taking office, renewal, voluntary and passive resignation are described, each with pre‑conditions and actions.

Safety Analysis

By defining the lease interval based on the time a follower first reads the lease, the start times of successive leaders never overlap, even without synchronized physical clocks. The analysis shows the required clock drift bounds.

T_w0 < T_w1
T_r0 < T_r1 < T_w2 < T_w3

Implementation Details

Physical timestamps are stored in the lease to allow fast recovery after a leader restart and to handle possible write‑through timeouts.

Parameter Consensus

Leaders publish heartbeat interval and lease duration to the shared store; the next leader must use these parameters, enabling safe hot‑updates of election settings.

Client Discovery

Clients read the address from the KV key; if the response indicates the node is not leader, they discard the cache and retry.

Using ByConity

The election scheme can be applied to services such as Resource Manager or Timestamp Oracle. When deployed on Kubernetes, scaling the replica count automatically adds or removes nodes without additional configuration.

Conclusion

The article presents a generic leader‑election solution based on shared storage and CAS, simplifying operation and configuration for ByConity and any stateless service.

cloud-nativehigh availabilityleader electiondistributed consensusshared-storageByConity
Volcano Engine Developer Services
Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.