How Bond Evolved into a Robust Distributed‑Lock Middleware for Scalable Services
This article chronicles the design, evolution, and performance evaluation of Bond, a distributed‑lock SDK used internally at Youzan, covering its multi‑phase roadmap, storage choices, timeout‑retry strategies, lock‑TTL recommendations, and practical pitfalls for reliable operations.
Background
Initially each business unit at Youzan built its own simple distributed‑lock solution using shared services such as Codis, Zookeeper, or custom Redis clusters, leading to duplicated effort, inconsistent detail control, storage selection uncertainty, poor scenario sharing, and incomplete monitoring.
To address these issues, a standardized distributed‑lock SDK named Bond was created, evolving through five distinct phases.
Phase Overview
Phase 1 – Availability: Provide a usable basic lock component.
Phase 2 – Solution: Offer a suite of scenario‑specific solutions, elevating the product from a simple lock to a full solution.
Phase 3 – Productization: Add middleware monitoring, logging, and lock‑trace capabilities.
Phase 4 – Internal Consolidation: Automate ticket integration and feedback loops.
Phase 5 – External Offering: Expose the component as a reusable service for Youzan Cloud.
Phase 1 Solution
The first phase focused on strong consistency (CP in the CAP model) and evaluated Zookeeper versus Etcd based on performance and lock‑support features.
Performance Comparison
Benchmarks measured CPU usage, memory consumption, write‑node speed, and latency for both candidates (images omitted for brevity).
Lock‑Support Evaluation
Zookeeper implements temporary ordered nodes with client‑side logic, causing multiple network requests per lock acquisition and potential premature node release under network jitter.
Etcd uses ordered nodes with server‑side handling, exposing a simple Lock command to the client.
Conclusion
Etcd was selected as the primary dependency for Phase 1, deployed across three data centers to ensure fault tolerance. However, the team noted that Etcd could become a bottleneck under heavy load, requiring thorough stress testing.
Phase 2 Solution
After operating Phase 1, performance requirements grew, and many business scenarios tolerated occasional inconsistency. Consequently, Bond switched its storage backend to Aerospike , leveraging its replica capabilities and existing internal deployment experience.
The codebase adopted the Adapter pattern to simplify storage swapping, illustrated by an architecture diagram (image omitted).
Bond also began providing higher‑level distributed‑lock solutions rather than just basic lock primitives.
Key Scenario Implementations (Phase 2 & 3)
3.1 Timeout‑Retry Re‑entry
Scenario: A non‑blocking lock request times out, leaving uncertainty about lock acquisition.
Solution: On SocketTimeoutException, retry the lock request. Use a composite value of node IP, random number, and timestamp as the lock value; on timeout, read the value to determine ownership and decide whether to retry or fail.
Pros: Increases the chance of successful lock acquisition.
Cons: Adds extra latency when a timeout occurs.
3.2 Unlock Thread‑Isolation
Scenario: Threads A and B may attempt to unlock each other’s locks.
Solution: Record the request timestamp T before sending a lock. Allow unlock only if the current time ≤ (T + leaseTime − 150 ms). Outside this window, either do nothing or raise an alert.
Pros: Prevents cross‑thread unlocks and provides alerts for edge cases.
Cons: In extreme cases, up to two threads may contend concurrently, requiring manual handling.
3.3 KV‑Based Blocking Lock
Scenario: Implement a non‑fair blocking lock on a pure KV store (e.g., Redis, KVDS).
Solution: Wrap lock acquisition in a Future with a configurable waitTime. On failure, sleep for half the average business latency and retry once. Recommended waitTime values: 500 ms for server‑side issues, 30‑50 ms for typical user requests (<3 ms average).
Pros: Simple, no extra middleware required.
Cons: Unsuitable for high‑contention scenarios.
3.4 Local Competition (Fast‑Fail)
Scenario: High lock contention on a single JVM (hot key).
Solution: Maintain a local cache mapping lock keys to thread IDs. Use putIfAbsent to acquire; if the key already maps to the current thread, treat as re‑entry; otherwise, abort. On unlock, verify ownership before removal. Cache entries expire based on the lock’s TTL.
Pros: Immediate fast‑fail without network latency.
Cons: Relies on cache reliability; edge‑case states may become stale.
3.5 Graceful Shutdown
Scenario: Locks may remain held when an application restarts, causing long TTL‑driven dead time.
Solution: Keep an in‑memory map of key→expiredTime. On successful lock, insert into the map; on unlock, remove it. A background thread periodically cleans expired entries. During Spring container shutdown, iterate the map and explicitly unlock any non‑expired locks.
Constraints: Only works for synchronous unlocks; unsuitable for asynchronous cross‑instance unlocks. Recommended for locks with TTL > 15 s.
3.6 Re‑entrant Lock
Scenario: The same thread invokes lock acquisition from multiple classes.
Solution: Store a thread‑local counter per lock key. On first acquisition, perform the real lock; subsequent calls increment the counter. Unlock decrements the counter and only performs the real unlock when the counter reaches zero.
Pros: Lightweight implementation.
Cons: Rare edge cases where identical keys across different logical locks cause incorrect behavior.
Phase 3 – Monitoring
Bond added centralized logging and monitoring to aid troubleshooting. Relying solely on local logs proved insufficient, especially in Youzan’s SC environment.
Detailed Discussions
Reasonable Lock TTL
TTL balances two risks: too short leads to premature release under network jitter or GC pauses; too long leaves orphaned locks after server crashes. Recommended defaults: for fast‑path services, set TTL to 2s; for longer jobs, use a conservative value based on observed max latency plus a 20 % safety margin.
Lock Retry Count
Assuming a 100 ms RPC timeout with a 1 % failure rate, a single retry reduces the error rate to 0.01 % but doubles the worst‑case latency to 200 ms. Youzan’s default is one retry, adjustable per business tolerance.
Pitfalls Encountered
Lock Management
Early versions isolated locks per application, causing difficulty in managing mixed blocking and non‑blocking lock workloads. Phase 2 introduced an API to register business‑key‑based lock scenarios, enabling finer‑grained capacity planning and fail‑over strategies, though it restricts a scenario to a single lock model.
Hotspot Locks
High contention on a single key can saturate Etcd or Aerospike. The recommended mitigation is client‑side local competition (see 3.4) or inserting a lightweight middle‑layer to limit concurrent RPCs.
Conclusion
Bond has become a core distributed‑lock solution within Youzan, continuously refined through real‑world scenarios. The team invites feedback and plans to open Bond to external developers via Youzan Cloud.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
