How to Build an Ultra‑Fast Ring Buffer for Producer‑Consumer in Java

This article explains a high‑performance ring buffer implementation for a multi‑threaded producer‑consumer model in Java, covering design choices, atomic index handling, benchmark results, and further optimizations such as cache‑line padding and multi‑buffer sharding.

Xiao Lou's Tech Notes
Xiao Lou's Tech Notes
Xiao Lou's Tech Notes
How to Build an Ultra‑Fast Ring Buffer for Producer‑Consumer in Java

Background

In a multi‑threaded producer‑consumer model, the requirements are:

Very high performance for producers delivering data.

Multiple producers and a single (or multiple) consumer(s).

When the consumer cannot keep up, a small amount of data loss is tolerable.

Producers generate data one item at a time.

For example, in a log‑collection scenario, logs are produced on different threads at a rate far exceeding the consumer; discarding some logs while minimizing logging overhead calls for an ultra‑fast buffering queue.

Implementation Details

Multiple producers submit messages to a bounded buffer. Ignoring thread safety would give the highest performance, but data would be overwritten. Since occasional data loss is acceptable only when the consumer lags, the buffer must be bounded; when full, typical strategies are:

Block until consumption.

Overwrite old data.

Discard new data.

To minimize producer overhead, overwriting is usually chosen.

Ring Buffer

A ring buffer (circular queue) implemented with an array solves the bounded‑buffer and overwrite‑strategy problems. By ensuring the producer’s index acquisition is thread‑safe, the array’s pre‑allocated contiguous memory yields excellent performance.

Ring Buffer Diagram
Ring Buffer Diagram

AtomicInteger

To obtain a thread‑safe index in the ring buffer, AtomicInteger is used, leading to the following helper class:

public class AtomicRangeInteger extends Number {
    private final AtomicInteger value;
    private final int startValue;
    private final int endValue;

    public AtomicRangeInteger(int startValue, int endValue) {
        this.startValue = startValue;
        this.endValue = endValue;
        this.value = new AtomicInteger(startValue);
    }

    public final int incrementAndGet() {
        int next;
        do {
            next = value.incrementAndGet();
            if (next > endValue && value.compareAndSet(next, startValue)) {
                return startValue;
            }
        } while (next > endValue);
        return next;
    }

    public final int get() { return value.intValue(); }
    @Override public int intValue() { return value.intValue(); }
    @Override public long longValue() { return value.intValue(); }
    @Override public float floatValue() { return value.intValue(); }
    @Override public double doubleValue() { return value.intValue(); }
}

The core ring buffer implementation is:

public final class RingBuffer<T> {
    private int bufferSize;
    private AtomicRangeInteger index;
    private final T[] buffer;

    @SuppressWarnings("unchecked")
    public RingBuffer(int bufferSize) {
        this.bufferSize = bufferSize;
        this.index = new AtomicRangeInteger(0, bufferSize);
        this.buffer = (T[]) new Object[bufferSize];
    }

    public final void offer(final T data) {
        buffer[index.incrementAndGet()] = data;
    }

    public final T poll(int index) {
        T tmp = buffer[index];
        buffer[index] = null;
        return tmp;
    }

    public int getBufferSize() { return bufferSize; }
}

The essential method for index acquisition is:

public final int incrementAndGet() {
    int next;
    do {
        next = value.incrementAndGet();
        if (next > endValue && value.compareAndSet(next, startValue)) {
            return startValue;
        }
    } while (next > endValue);
    return next;
}

The producer obtains the next free index atomically via incrementAndGet.

If the returned index exceeds the buffer size, it is wrapped to startValue using compareAndSet, retrying if another thread intervenes.

Why It’s Ultra‑Fast

An open‑source ring buffer implementation called Disruptor uses batch insertion and a compare‑and‑set strategy, blocking when the buffer is full and requiring the capacity to be a power of two. Our implementation replaces compareAndSet with incrementAndGet, yielding about three times higher throughput in benchmarks (≈40 M ops/s vs 15 M ops/s).

Benchmark results:

Benchmark                     Mode  Cnt          Score   Error  Units
RingBufferBenchmark.testV0   thrpt    2  39969002.156          ops/s
RingBufferBenchmark.testV1   thrpt    2  15533576.961          ops/s

The difference stems from the underlying implementation of incrementAndGet. In JDK 8+, it may use a native fetch‑and‑add CPU instruction when available, which is far faster than a Java‑level CAS loop.

Unsafe is specially handled; if the platform supports fetch‑and‑add, getAndAddInt executes a native instruction; otherwise it falls back to a CAS‑based loop.

On JDK 7 the performance gap disappears because fetch‑and‑add is not used.

Further Optimization Opportunities

Cache‑line padding: The three fields in AtomicRangeInteger cause false sharing. Adding the @Contended annotation pads the frequently updated value field, improving throughput.

public class AtomicRangeIntegerV2 extends Number {
    @Contended
    protected final AtomicInteger value;
    protected final int startValue;
    protected final int endValue;
    ...
}

Benchmark with @Contended (v2) shows a further increase (≈72 M ops/s vs 44 M ops/s).

Benchmark                     Mode  Cnt          Score   Error  Units
RingBufferBenchmark.testV2   thrpt    2  72095754.040          ops/s
RingBufferBenchmark.testV0   thrpt    2  44360926.943          ops/s

Multiple ring buffers to reduce contention: Distribute producers across several buffers, ideally one buffer per thread, similar to techniques used in high‑performance counters.

Details on sharding strategies can be found in the author’s earlier article about building a faster counter than LongAdder.

Easter Egg

The ring buffer implementation was inspired by SkyWalking’s version, which originally used CAS and did not meet performance expectations. After applying the optimizations above, the author contributed the improved code to SkyWalking (see GitHub pull requests 2874 and 2930), and it is now part of the project.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Producer ConsumerRing BufferJava concurrencyAtomicInteger
Xiao Lou's Tech Notes
Written by

Xiao Lou's Tech Notes

Backend technology sharing, architecture design, performance optimization, source code reading, troubleshooting, and pitfall practices

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.