Backend Development 15 min read

Async Mutex: Eliminating Blocking in High‑Performance Concurrent Java Programs

This article analyses the performance challenges of high‑concurrency Java applications, explains how misuse of atomic operations and blocking degrade throughput, introduces an asynchronous monitor concept and a concrete AsyncMutex implementation, and presents experimental results showing its scalability advantages over traditional ReentrantLock‑based locking.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Async Mutex: Eliminating Blocking in High‑Performance Concurrent Java Programs

In the era of fast‑growing internet services, concurrency is ubiquitous and writing high‑performance concurrent programs has become a key skill for developers, yet many encounter severe performance loss caused by improper use of atomic operations, concurrent data structures, and especially blocking.

The author first points out two main sources of performance degradation: over‑use of AtomicXXX and ConcurrentXXX classes without evaluating their impact, and blocking operations that force threads to wait.

Two typical blocking scenarios are illustrated: (1) merging results of multiple unrelated asynchronous calls, analogous to assembling a computer from parts; (2) handling shared critical sections, exemplified by a bank transfer where concurrent modifications must be avoided.

A simple mathematical model shows that, given t threads, g groups, c tasks per group, and each task taking x milliseconds, the theoretical execution time is ((g * c * x) / t) ms, but real measurements with Java ReentrantLock reveal far larger times due to blocking.

To address these issues, the article revisits the classic “event loop” idea and presents an asynchronous monitor (AsyncMonitor) implementation in Java, derived from Herb Sutter’s C++ example. The core idea is to replace a mutex with a message queue, allowing the critical section to be executed asynchronously while the caller proceeds without blocking.

public final class AsyncMonitor<T> { ... }

Although the AsyncMonitor works, it has two drawbacks: (1) creating many threads for each monitor can add overhead; (2) the API does not support post‑critical‑section logic such as logging or monitoring.

Consequently, the author proposes the “Async Mutex” concept, a Java class that provides a non‑blocking lock‑like interface while allowing users to specify both the critical‑section work (a Callable ) and the follow‑up action (a Consumer ). The API exposes a single attach method:

public <T> void attach(Executor executor, Callable<T> callable, Consumer<? super T> callback)

The implementation uses a lock‑free linked list of InvocationNode objects, an AtomicReferenceFieldUpdater to manage the forward list, and an AtomicInteger to track pending tasks. When the first task is attached, it becomes the head of the list and is immediately invoked; subsequent tasks are enqueued and processed in order without blocking the caller.

public final class AsyncMutex { ... }

Performance experiments compare the AsyncMutex with the traditional ReentrantLock under identical workloads. For a configuration of t=100 , g=100 , c=100 , x=10 ms, the theoretical time is 1 s, but the ReentrantLock version takes about 34.5 s, while AsyncMutex completes in roughly 1.09 s, achieving a throughput up to 32× higher.

Further tests varying the number of groups show that AsyncMutex consistently outperforms ReentrantLock in scalability, especially when many tasks share a critical section. The author also notes that using ConcurrentLinkedQueue introduced unnecessary overhead, and replacing it with a custom lock‑free list using AtomicReferenceFieldUpdater improved task‑addition performance by about 50%.

Finally, the article discusses the applicability of the async‑mutex idea beyond shared‑memory environments. In distributed systems (e.g., Redis‑backed services), similar blocking problems arise when using distributed locks; the author suggests that the async‑mutex principle could be adapted, although implementation details would differ due to the lack of CAS primitives and reliance on Lua scripts for atomicity.

The test environment is a Windows 7 x64 machine with an Intel i7‑4700HQ CPU, 12 GB RAM, and Java 1.8.0_144 runtime.

JavaPerformanceconcurrencylock-freeblockingasync mutex
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.