How ByConity Achieves Leader Election with Shared Storage and CAS
This article explains how ByConity uses a high‑availability shared KV store and CAS operations to implement a lightweight, fault‑tolerant leader election mechanism that eliminates the need for external services like Zookeeper, simplifies node management, and ensures safe leader transitions in a cloud‑native data warehouse.
Background
In traditional share‑nothing micro‑service architectures, DNS is used for service discovery and components such as Zookeeper, Etcd or Consul provide leader‑follower election, which brings configuration and operational complexity.
With the rise of share‑everything storage‑compute separation, a high‑availability KV store can be used as shared memory, treating each compute node as a thread in a single‑machine model, enabling discovery and synchronization without extra services.
ByConity Basic Architecture
Reference to the article describing ByConity’s storage‑compute separation architecture based on ClickHouse.
Problems with ClickHouse‑keeper
At least three keeper nodes are required for fault tolerance.
Node addition/removal and service discovery are complex, requiring configuration changes on all nodes.
Container restarts change IP/port, making rapid recovery difficult because server_id and address are tightly bound.
Design Goals
Election component should be embeddable as a library, similar to pthread mutex.
Support arbitrary number of replica nodes.
Node addition/removal requires no extra operations.
Changing a node’s listening address requires no extra operations.
Election succeeds as long as one replica is available.
Replicas do not need to communicate with each other or synchronize clocks.
Leader Election Based on Shared Storage
Terminology
Replica : multiple equal instances of a service.
Business : service logic other than election.
Follower : replica that cannot provide business service.
Leader : replica that can provide business service.
Client : node that needs to access the leader.
Design Idea
The design mimics a pthread mutex: the lock is placed in shared memory, CAS is used for atomic writes, and the OS provides futex‑like wait/notify. By treating each election node as a thread and the KV store as shared memory, the node that wins the CAS becomes the leader.
Basic Rules
Each node is either follower or leader; at any moment only one node considers itself leader.
Any node can read a KV key to discover the current leader; if the key does not exist, no leader has been elected.
The leader periodically CAS‑updates the lease.last_refresh_time to extend its term.
The leader can voluntarily yield by setting lease.status to Yield.
Followers periodically GET the key; if the lease is expired or missing, they attempt a CAS to become leader.
Election Process
Detailed steps for candidate selection, winning, taking office, renewal, voluntary and passive resignation are described, each with pre‑conditions and actions.
Safety Analysis
By defining the lease interval based on the time a follower first reads the lease, the start times of successive leaders never overlap, even without synchronized physical clocks. The analysis shows the required clock drift bounds.
T_w0 < T_w1
T_r0 < T_r1 < T_w2 < T_w3Implementation Details
Physical timestamps are stored in the lease to allow fast recovery after a leader restart and to handle possible write‑through timeouts.
Parameter Consensus
Leaders publish heartbeat interval and lease duration to the shared store; the next leader must use these parameters, enabling safe hot‑updates of election settings.
Client Discovery
Clients read the address from the KV key; if the response indicates the node is not leader, they discard the cache and retry.
Using ByConity
The election scheme can be applied to services such as Resource Manager or Timestamp Oracle. When deployed on Kubernetes, scaling the replica count automatically adds or removes nodes without additional configuration.
Conclusion
The article presents a generic leader‑election solution based on shared storage and CAS, simplifying operation and configuration for ByConity and any stateless service.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
