Operations 20 min read

How Bond Evolved into a Robust Distributed‑Lock Middleware for Scalable Services

This article chronicles the design, evolution, and performance evaluation of Bond, a distributed‑lock SDK used internally at Youzan, covering its multi‑phase roadmap, storage choices, timeout‑retry strategies, lock‑TTL recommendations, and practical pitfalls for reliable operations.

Youzan Coder
Youzan Coder
Youzan Coder
How Bond Evolved into a Robust Distributed‑Lock Middleware for Scalable Services

Background

Initially each business unit at Youzan built its own simple distributed‑lock solution using shared services such as Codis, Zookeeper, or custom Redis clusters, leading to duplicated effort, inconsistent detail control, storage selection uncertainty, poor scenario sharing, and incomplete monitoring.

To address these issues, a standardized distributed‑lock SDK named Bond was created, evolving through five distinct phases.

Phase Overview

Phase 1 – Availability: Provide a usable basic lock component.

Phase 2 – Solution: Offer a suite of scenario‑specific solutions, elevating the product from a simple lock to a full solution.

Phase 3 – Productization: Add middleware monitoring, logging, and lock‑trace capabilities.

Phase 4 – Internal Consolidation: Automate ticket integration and feedback loops.

Phase 5 – External Offering: Expose the component as a reusable service for Youzan Cloud.

Phase 1 Solution

The first phase focused on strong consistency (CP in the CAP model) and evaluated Zookeeper versus Etcd based on performance and lock‑support features.

Performance Comparison

Benchmarks measured CPU usage, memory consumption, write‑node speed, and latency for both candidates (images omitted for brevity).

Lock‑Support Evaluation

Zookeeper implements temporary ordered nodes with client‑side logic, causing multiple network requests per lock acquisition and potential premature node release under network jitter.

Etcd uses ordered nodes with server‑side handling, exposing a simple Lock command to the client.

Conclusion

Etcd was selected as the primary dependency for Phase 1, deployed across three data centers to ensure fault tolerance. However, the team noted that Etcd could become a bottleneck under heavy load, requiring thorough stress testing.

Phase 2 Solution

After operating Phase 1, performance requirements grew, and many business scenarios tolerated occasional inconsistency. Consequently, Bond switched its storage backend to Aerospike , leveraging its replica capabilities and existing internal deployment experience.

The codebase adopted the Adapter pattern to simplify storage swapping, illustrated by an architecture diagram (image omitted).

Bond also began providing higher‑level distributed‑lock solutions rather than just basic lock primitives.

Key Scenario Implementations (Phase 2 & 3)

3.1 Timeout‑Retry Re‑entry

Scenario: A non‑blocking lock request times out, leaving uncertainty about lock acquisition.

Solution: On SocketTimeoutException, retry the lock request. Use a composite value of node IP, random number, and timestamp as the lock value; on timeout, read the value to determine ownership and decide whether to retry or fail.

Pros: Increases the chance of successful lock acquisition.

Cons: Adds extra latency when a timeout occurs.

3.2 Unlock Thread‑Isolation

Scenario: Threads A and B may attempt to unlock each other’s locks.

Solution: Record the request timestamp T before sending a lock. Allow unlock only if the current time ≤ (T + leaseTime − 150 ms). Outside this window, either do nothing or raise an alert.

Pros: Prevents cross‑thread unlocks and provides alerts for edge cases.

Cons: In extreme cases, up to two threads may contend concurrently, requiring manual handling.

3.3 KV‑Based Blocking Lock

Scenario: Implement a non‑fair blocking lock on a pure KV store (e.g., Redis, KVDS).

Solution: Wrap lock acquisition in a Future with a configurable waitTime. On failure, sleep for half the average business latency and retry once. Recommended waitTime values: 500 ms for server‑side issues, 30‑50 ms for typical user requests (<3 ms average).

Pros: Simple, no extra middleware required.

Cons: Unsuitable for high‑contention scenarios.

3.4 Local Competition (Fast‑Fail)

Scenario: High lock contention on a single JVM (hot key).

Solution: Maintain a local cache mapping lock keys to thread IDs. Use putIfAbsent to acquire; if the key already maps to the current thread, treat as re‑entry; otherwise, abort. On unlock, verify ownership before removal. Cache entries expire based on the lock’s TTL.

Pros: Immediate fast‑fail without network latency.

Cons: Relies on cache reliability; edge‑case states may become stale.

3.5 Graceful Shutdown

Scenario: Locks may remain held when an application restarts, causing long TTL‑driven dead time.

Solution: Keep an in‑memory map of key→expiredTime. On successful lock, insert into the map; on unlock, remove it. A background thread periodically cleans expired entries. During Spring container shutdown, iterate the map and explicitly unlock any non‑expired locks.

Constraints: Only works for synchronous unlocks; unsuitable for asynchronous cross‑instance unlocks. Recommended for locks with TTL > 15 s.

3.6 Re‑entrant Lock

Scenario: The same thread invokes lock acquisition from multiple classes.

Solution: Store a thread‑local counter per lock key. On first acquisition, perform the real lock; subsequent calls increment the counter. Unlock decrements the counter and only performs the real unlock when the counter reaches zero.

Pros: Lightweight implementation.

Cons: Rare edge cases where identical keys across different logical locks cause incorrect behavior.

Phase 3 – Monitoring

Bond added centralized logging and monitoring to aid troubleshooting. Relying solely on local logs proved insufficient, especially in Youzan’s SC environment.

Detailed Discussions

Reasonable Lock TTL

TTL balances two risks: too short leads to premature release under network jitter or GC pauses; too long leaves orphaned locks after server crashes. Recommended defaults: for fast‑path services, set TTL to 2s; for longer jobs, use a conservative value based on observed max latency plus a 20 % safety margin.

Lock Retry Count

Assuming a 100 ms RPC timeout with a 1 % failure rate, a single retry reduces the error rate to 0.01 % but doubles the worst‑case latency to 200 ms. Youzan’s default is one retry, adjustable per business tolerance.

Pitfalls Encountered

Lock Management

Early versions isolated locks per application, causing difficulty in managing mixed blocking and non‑blocking lock workloads. Phase 2 introduced an API to register business‑key‑based lock scenarios, enabling finer‑grained capacity planning and fail‑over strategies, though it restricts a scenario to a single lock model.

Hotspot Locks

High contention on a single key can saturate Etcd or Aerospike. The recommended mitigation is client‑side local competition (see 3.4) or inserting a lightweight middle‑layer to limit concurrent RPCs.

Conclusion

Bond has become a core distributed‑lock solution within Youzan, continuously refined through real‑world scenarios. The team invites feedback and plans to open Bond to external developers via Youzan Cloud.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performanceOperationsmiddlewareDistributed LocketcdAerospike
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.