Fundamentals 8 min read

How Fencing Tokens Ensure Safety and Liveness in Distributed Lock Services

This article explores how fencing tokens can provide safety and liveness guarantees in distributed lock services, illustrating fault scenarios, token-based conflict resolution, and abstract system models that help engineers prioritize correctness while tolerating temporary unavailability.

Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
Xiaokun's Architecture Exploration Notes
How Fencing Tokens Ensure Safety and Liveness in Distributed Lock Services

In many distributed systems we trade resource utilization and cost, often using cheap but unreliable components that introduce unbounded latency, clock drift, and process pauses. Building a fault‑tolerant distributed system requires mechanisms that can handle these realities.

Fencing Tokens Example

Consider a previous example where acquiring a lease ID caused a stop‑the‑world (STW) pause, leading to conflicting writes and file corruption. Two clients held write locks on the same file and both wrote successfully, but the storage layer lacked a way to decide which write should prevail.

To solve this, we introduce a Fencing Tokens mechanism. When a lock is acquired, a unique, monotonically increasing token is also issued—implemented via Google’s TrueTime timestamps or logical clocks. Clients include this token with each write; the storage layer accepts the write only if the token is greater than the last persisted token, rejecting older tokens.

File corruption due to concurrent writes
File corruption due to concurrent writes
Fencing token flow diagram
Fencing token flow diagram

Safety & Liveness

From the fencing token example we can derive the following properties for a lock service:

Uniqueness: any two token requests return different values.

Monotonic ordering: if request X precedes request Y, then token Tx < Ty.

Availability: a non‑crashed node that requests a token will eventually receive a response.

Uniqueness and monotonic ordering correspond to the safety property of a distributed system, while availability corresponds to liveness . Safety means “nothing bad happens”; liveness means “something good eventually happens.”

Design Implications

When designing a distributed lock service we must first guarantee safety—ensure tokens are unique and ordered—before considering liveness. The service may be unavailable for short periods (e.g., during a crash‑recovery), but as long as safety is preserved, temporary unavailability is tolerable.

Distributed Abstract System Models

To reason about fault tolerance we abstract distributed systems into models. Two common models are:

Time‑based system model
Time‑based system model diagram
Time‑based system model diagram
Node‑failure system model
Node‑failure system model diagram
Node‑failure system model diagram

Models help us formalize faults and design algorithms that satisfy safety and liveness under assumed conditions. For example, storing data in memory means a node crash loses data; persisting to disk (using RAID, SAN, NAS) mitigates that, but introduces consistency challenges. Strong consistency maps to safety, while eventual consistency maps to liveness.

Conclusion

Distinguishing safety and liveness is essential when building fault‑tolerant distributed systems. Engineers must identify which failures are unacceptable at the business level (safety) and which can be tolerated temporarily (liveness), then make architectural trade‑offs accordingly.

Distributed Systemsfault tolerancefencing tokenslock servicesafety liveness
Xiaokun's Architecture Exploration Notes
Written by

Xiaokun's Architecture Exploration Notes

10 years of backend architecture design | AI engineering infrastructure, storage architecture design, and performance optimization | Former senior developer at NetEase, Douyu, Inke, etc.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.