Backend Development 16 min read

Can You Build a Faster Counter Than Java’s LongAdder? A Deep Dive

An in‑depth Java performance study explores LongAdder, compares it with AtomicLong and lock‑based counters using JMH, and walks through successive custom implementations (V0‑V5) that apply striping, modulo optimization, false‑sharing elimination, and advanced hash probing to approach or surpass LongAdder’s throughput.

Xiao Lou's Tech Notes

May 19, 2020

Can You Build a Faster Counter Than Java’s LongAdder? A Deep Dive

Powerful LongAdder

LongAdder, introduced in JDK 8, is a thread‑safe counter designed for high‑concurrency statistical scenarios.

Before LongAdder, developers used either locking (poor performance) or AtomicLong (struggles under heavy contention). LongAdder provides add and sum methods and uses striped cells to reduce contention; sum is suitable only for approximate statistics.

Performance is measured with JMH (fork 1, 4 threads, 2 warm‑up, 2 measurement iterations on a 4‑core machine). The benchmark code is provided below.

private final AtomicLong atomicLong = new AtomicLong();
    private final LongAdder longAdder = new LongAdder();
    private long counter = 0;

    public static void main(String[] args) throws RunnerException {
        Options opt = new OptionsBuilder()
                .include(LongAdderTest.class.getSimpleName())
                .forks(1)
                .threads(4)
                .warmupIterations(2)
                .measurementIterations(2)
                .mode(Mode.Throughput)
                .syncIterations(false)
                .build();

        new Runner(opt).run();
    }

    @Benchmark
    public void testAtomic() {
        atomicLong.incrementAndGet();
    }

    @Benchmark
    public void testLongAdder() {
        longAdder.increment();
    }

    @Benchmark
    public synchronized void testLockAdder() {
        counter++;
    }

Benchmark results show LongAdder far outperforms the locked counter and AtomicLong.

Benchmark                          Mode  Cnt          Score   Error  Units
LongAdderTest.testAtomic          thrpt    2   73520672.658          ops/s
LongAdderTest.testLockAdder       thrpt    2   23456856.867          ops/s
LongAdderTest.testLongAdder       thrpt    2  300013067.345          ops/s

AtomicLong Striped (V0)

Implementing a striped counter using an AtomicLong array (coreSize ≈ CPU cores) yields only a modest improvement over plain AtomicLong.

public class MyLongAdderV0 {

    private final int coreSize;
    private final AtomicLong[] counts;

    public MyLongAdderV0(int coreSize) {
        this.coreSize = coreSize;
        this.counts = new AtomicLong[coreSize];
        for (int i = 0; i < coreSize; i++) {
            this.counts[i] = new AtomicLong();
        }
    }

    public void increment() {
        int index = (int) (Thread.currentThread().getId() % coreSize);
        counts[index].incrementAndGet();
    }
}

V0 performance varies with coreSize and thread distribution.

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testAtomic         thrpt    2   73391661.579          ops/s
LongAdderTest.testLongAdder      thrpt    2  309539056.885          ops/s
LongAdderTest.testMyLongAdderV0  thrpt    2   83737867.380          ops/s

Testing different coreSize values (4, 8, 16, 32) shows inconsistent results, suggesting the simple modulo of thread ID does not distribute load evenly.

Benchmark                        (coreSize)   Mode  Cnt          Score   Error  Units
LongAdderTest.testMyLongAdderV0           4  thrpt    2   62328997.667          ops/s
LongAdderTest.testMyLongAdderV0           8  thrpt    2  124725716.902          ops/s
LongAdderTest.testMyLongAdderV0          16  thrpt    2   84718415.566          ops/s
LongAdderTest.testMyLongAdderV0          32  thrpt    2   85321816.442          ops/s

Modulo Optimization (V1)

Replacing the modulo operation with a bit‑mask (coreSize must be a power of two) slightly improves V0.

public class MyLongAdderV1 {

    private final int coreSize;
    private final AtomicLong[] counts;

    public MyLongAdderV1(int coreSize) {
        this.coreSize = coreSize;
        this.counts = new AtomicLong[coreSize];
        for (int i = 0; i < coreSize; i++) {
            this.counts[i] = new AtomicLong();
        }
    }

    public void increment() {
        int index = (int) (Thread.currentThread().getId() & (coreSize - 1));
        counts[index].incrementAndGet();
    }

}

Benchmark results:

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  312683635.190          ops/s
LongAdderTest.testMyLongAdderV0  thrpt    2   60641758.648          ops/s
LongAdderTest.testMyLongAdderV1  thrpt    2  100887869.829          ops/s

Still far below LongAdder.

Eliminating False Sharing (V2)

False sharing occurs when multiple threads update variables that reside on the same cache line, causing frequent cache invalidations. Padding each AtomicLong with @Contended isolates them.

abstract class RingBufferPad {
    protected long p1, p2, p3, p4, p5, p6, p7;
}

abstract class RingBufferFields<E> extends RingBufferPad {
    protected long value;
}

public final class RingBuffer<E> extends RingBufferFields<E> {
    protected long p1, p2, p3, p4, p5, p6, p7;
}

V2 implementation:

public class MyLongAdderV2 {

    private static class AtomicLongWrap {
        @Contended
        private final AtomicLong value = new AtomicLong();
    }

    private final int coreSize;
    private final AtomicLongWrap[] counts;

    public MyLongAdderV2(int coreSize) {
        this.coreSize = coreSize;
        this.counts = new AtomicLongWrap[coreSize];
        for (int i = 0; i < coreSize; i++) {
            this.counts[i] = new AtomicLongWrap();
        }
    }

    public void increment() {
        int index = (int) (Thread.currentThread().getId() & (coreSize - 1));
        counts[index].value.incrementAndGet();
    }

}

Benchmark results (4 threads):

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  272733686.330          ops/s
LongAdderTest.testMyLongAdderV2  thrpt    2  307754425.667          ops/s

V2 can even beat LongAdder in this configuration, but its advantage diminishes as thread count increases.

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  260909722.754          ops/s
LongAdderTest.testMyLongAdderV2  thrpt    2  215785206.276          ops/s

Hash Algorithm Improvements

Various hash strategies were explored.

Using Thread.hashCode() (V3) – performance degraded.

Using ThreadLocalRandom (V4) – still slower than LongAdder.

Custom probe‑based hashing with compareAndSet and fallback increment (V5) – achieved stable throughput comparable to LongAdder across thread counts.

V3 increment method:

public void increment() {
    int index = Thread.currentThread().hashCode() & (coreSize - 1);
    counts[index].incrementAndGet();
}

V4 increment method:

public void increment() {
    counts[ThreadLocalRandom.current().nextInt(coreSize)].value.incrementAndGet();
}

V5 full implementation:

public class MyLongAdderV5 {

    private static sun.misc.Unsafe UNSAFE = null;
    private static final long PROBE;
    static {
        try {
            Field f = Unsafe.class.getDeclaredField("theUnsafe");
            f.setAccessible(true);
            UNSAFE = (Unsafe) f.get(null);
        } catch (Exception e) {
        }
        try {
            Class<?> tk = Thread.class;
            PROBE = UNSAFE.objectFieldOffset(tk.getDeclaredField("threadLocalRandomProbe"));
        } catch (Exception e) {
            throw new Error(e);
        }
    }

    static final int getProbe() {
        return UNSAFE.getInt(Thread.currentThread(), PROBE);
    }

    static final int advanceProbe(int probe) {
        probe ^= probe << 13;   // xorshift
        probe ^= probe >>> 17;
        probe ^= probe << 5;
        UNSAFE.putInt(Thread.currentThread(), PROBE, probe);
        return probe;
    }

    private static class AtomicLongWrap {
        @Contended
        private final AtomicLong value = new AtomicLong();
    }

    private final int coreSize;
    private final AtomicLongWrap[] counts;

    public MyLongAdderV5(int coreSize) {
        this.coreSize = coreSize;
        this.counts = new AtomicLongWrap[coreSize];
        for (int i = 0; i < coreSize; i++) {
            this.counts[i] = new AtomicLongWrap();
        }
    }

    public void increment() {
        int h = getProbe();
        int index = getProbe() & (coreSize - 1);
        long r;
        if (!counts[index].value.compareAndSet(r = counts[index].value.get(), r + 1)) {
            if (h == 0) {
                ThreadLocalRandom.current();
                h = getProbe();
            }
            advanceProbe(h);
            counts[index].value.getAndIncrement();
        }
    }

}

Benchmark results for V5:

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  274131797.300          ops/s
LongAdderTest.testMyLongAdderV5  thrpt    2  298402832.456          ops/s

8 threads:

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  324982482.774          ops/s
LongAdderTest.testMyLongAdderV5  thrpt    2  290476796.289          ops/s

16 threads:

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  291180444.998          ops/s
LongAdderTest.testMyLongAdderV5  thrpt    2  282745610.470          ops/s

32 threads:

Benchmark                         Mode  Cnt          Score   Error  Units
LongAdderTest.testLongAdder      thrpt    2  294237473.396          ops/s
LongAdderTest.testMyLongAdderV5  thrpt    2  301187346.873          ops/s

Conclusion

Creating a counter that consistently outperforms LongAdder is difficult. The most impactful optimizations are:

Striping – basic but essential.

Modulo (bit‑mask) optimization – simple improvement.

False‑sharing elimination – large performance gain.

Robust hash algorithm – the "black‑tech" that ensures stable high throughput.

All test code is available at https://github.com/lkxiaolou/all-in-one/tree/master/src/main/java/org/newboo/longadder . Recommended reading: “Benchmarking with JMH – 36 official examples”.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

longadder false sharing Java concurrency performance benchmarking JMH striped counter

Written by

Xiao Lou's Tech Notes

Backend technology sharing, architecture design, performance optimization, source code reading, troubleshooting, and pitfall practices

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.