Can You Build a Faster Counter Than Java’s LongAdder? A Deep Dive
An in‑depth Java performance study explores LongAdder, compares it with AtomicLong and lock‑based counters using JMH, and walks through successive custom implementations (V0‑V5) that apply striping, modulo optimization, false‑sharing elimination, and advanced hash probing to approach or surpass LongAdder’s throughput.
Powerful LongAdder
LongAdder, introduced in JDK 8, is a thread‑safe counter designed for high‑concurrency statistical scenarios.
Before LongAdder, developers used either locking (poor performance) or AtomicLong (struggles under heavy contention). LongAdder provides add and sum methods and uses striped cells to reduce contention; sum is suitable only for approximate statistics.
Performance is measured with JMH (fork 1, 4 threads, 2 warm‑up, 2 measurement iterations on a 4‑core machine). The benchmark code is provided below.
private final AtomicLong atomicLong = new AtomicLong();
private final LongAdder longAdder = new LongAdder();
private long counter = 0;
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder()
.include(LongAdderTest.class.getSimpleName())
.forks(1)
.threads(4)
.warmupIterations(2)
.measurementIterations(2)
.mode(Mode.Throughput)
.syncIterations(false)
.build();
new Runner(opt).run();
}
@Benchmark
public void testAtomic() {
atomicLong.incrementAndGet();
}
@Benchmark
public void testLongAdder() {
longAdder.increment();
}
@Benchmark
public synchronized void testLockAdder() {
counter++;
}Benchmark results show LongAdder far outperforms the locked counter and AtomicLong.
Benchmark Mode Cnt Score Error Units
LongAdderTest.testAtomic thrpt 2 73520672.658 ops/s
LongAdderTest.testLockAdder thrpt 2 23456856.867 ops/s
LongAdderTest.testLongAdder thrpt 2 300013067.345 ops/sAtomicLong Striped (V0)
Implementing a striped counter using an AtomicLong array (coreSize ≈ CPU cores) yields only a modest improvement over plain AtomicLong.
public class MyLongAdderV0 {
private final int coreSize;
private final AtomicLong[] counts;
public MyLongAdderV0(int coreSize) {
this.coreSize = coreSize;
this.counts = new AtomicLong[coreSize];
for (int i = 0; i < coreSize; i++) {
this.counts[i] = new AtomicLong();
}
}
public void increment() {
int index = (int) (Thread.currentThread().getId() % coreSize);
counts[index].incrementAndGet();
}
}V0 performance varies with coreSize and thread distribution.
Benchmark Mode Cnt Score Error Units
LongAdderTest.testAtomic thrpt 2 73391661.579 ops/s
LongAdderTest.testLongAdder thrpt 2 309539056.885 ops/s
LongAdderTest.testMyLongAdderV0 thrpt 2 83737867.380 ops/sTesting different coreSize values (4, 8, 16, 32) shows inconsistent results, suggesting the simple modulo of thread ID does not distribute load evenly.
Benchmark (coreSize) Mode Cnt Score Error Units
LongAdderTest.testMyLongAdderV0 4 thrpt 2 62328997.667 ops/s
LongAdderTest.testMyLongAdderV0 8 thrpt 2 124725716.902 ops/s
LongAdderTest.testMyLongAdderV0 16 thrpt 2 84718415.566 ops/s
LongAdderTest.testMyLongAdderV0 32 thrpt 2 85321816.442 ops/sModulo Optimization (V1)
Replacing the modulo operation with a bit‑mask (coreSize must be a power of two) slightly improves V0.
public class MyLongAdderV1 {
private final int coreSize;
private final AtomicLong[] counts;
public MyLongAdderV1(int coreSize) {
this.coreSize = coreSize;
this.counts = new AtomicLong[coreSize];
for (int i = 0; i < coreSize; i++) {
this.counts[i] = new AtomicLong();
}
}
public void increment() {
int index = (int) (Thread.currentThread().getId() & (coreSize - 1));
counts[index].incrementAndGet();
}
}Benchmark results:
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 312683635.190 ops/s
LongAdderTest.testMyLongAdderV0 thrpt 2 60641758.648 ops/s
LongAdderTest.testMyLongAdderV1 thrpt 2 100887869.829 ops/sStill far below LongAdder.
Eliminating False Sharing (V2)
False sharing occurs when multiple threads update variables that reside on the same cache line, causing frequent cache invalidations. Padding each AtomicLong with @Contended isolates them.
abstract class RingBufferPad {
protected long p1, p2, p3, p4, p5, p6, p7;
}
abstract class RingBufferFields<E> extends RingBufferPad {
protected long value;
}
public final class RingBuffer<E> extends RingBufferFields<E> {
protected long p1, p2, p3, p4, p5, p6, p7;
}V2 implementation:
public class MyLongAdderV2 {
private static class AtomicLongWrap {
@Contended
private final AtomicLong value = new AtomicLong();
}
private final int coreSize;
private final AtomicLongWrap[] counts;
public MyLongAdderV2(int coreSize) {
this.coreSize = coreSize;
this.counts = new AtomicLongWrap[coreSize];
for (int i = 0; i < coreSize; i++) {
this.counts[i] = new AtomicLongWrap();
}
}
public void increment() {
int index = (int) (Thread.currentThread().getId() & (coreSize - 1));
counts[index].value.incrementAndGet();
}
}Benchmark results (4 threads):
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 272733686.330 ops/s
LongAdderTest.testMyLongAdderV2 thrpt 2 307754425.667 ops/sV2 can even beat LongAdder in this configuration, but its advantage diminishes as thread count increases.
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 260909722.754 ops/s
LongAdderTest.testMyLongAdderV2 thrpt 2 215785206.276 ops/sHash Algorithm Improvements
Various hash strategies were explored.
Using Thread.hashCode() (V3) – performance degraded.
Using ThreadLocalRandom (V4) – still slower than LongAdder.
Custom probe‑based hashing with compareAndSet and fallback increment (V5) – achieved stable throughput comparable to LongAdder across thread counts.
V3 increment method:
public void increment() {
int index = Thread.currentThread().hashCode() & (coreSize - 1);
counts[index].incrementAndGet();
}V4 increment method:
public void increment() {
counts[ThreadLocalRandom.current().nextInt(coreSize)].value.incrementAndGet();
}V5 full implementation:
public class MyLongAdderV5 {
private static sun.misc.Unsafe UNSAFE = null;
private static final long PROBE;
static {
try {
Field f = Unsafe.class.getDeclaredField("theUnsafe");
f.setAccessible(true);
UNSAFE = (Unsafe) f.get(null);
} catch (Exception e) {
}
try {
Class<?> tk = Thread.class;
PROBE = UNSAFE.objectFieldOffset(tk.getDeclaredField("threadLocalRandomProbe"));
} catch (Exception e) {
throw new Error(e);
}
}
static final int getProbe() {
return UNSAFE.getInt(Thread.currentThread(), PROBE);
}
static final int advanceProbe(int probe) {
probe ^= probe << 13; // xorshift
probe ^= probe >>> 17;
probe ^= probe << 5;
UNSAFE.putInt(Thread.currentThread(), PROBE, probe);
return probe;
}
private static class AtomicLongWrap {
@Contended
private final AtomicLong value = new AtomicLong();
}
private final int coreSize;
private final AtomicLongWrap[] counts;
public MyLongAdderV5(int coreSize) {
this.coreSize = coreSize;
this.counts = new AtomicLongWrap[coreSize];
for (int i = 0; i < coreSize; i++) {
this.counts[i] = new AtomicLongWrap();
}
}
public void increment() {
int h = getProbe();
int index = getProbe() & (coreSize - 1);
long r;
if (!counts[index].value.compareAndSet(r = counts[index].value.get(), r + 1)) {
if (h == 0) {
ThreadLocalRandom.current();
h = getProbe();
}
advanceProbe(h);
counts[index].value.getAndIncrement();
}
}
}Benchmark results for V5:
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 274131797.300 ops/s
LongAdderTest.testMyLongAdderV5 thrpt 2 298402832.456 ops/s8 threads:
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 324982482.774 ops/s
LongAdderTest.testMyLongAdderV5 thrpt 2 290476796.289 ops/s16 threads:
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 291180444.998 ops/s
LongAdderTest.testMyLongAdderV5 thrpt 2 282745610.470 ops/s32 threads:
Benchmark Mode Cnt Score Error Units
LongAdderTest.testLongAdder thrpt 2 294237473.396 ops/s
LongAdderTest.testMyLongAdderV5 thrpt 2 301187346.873 ops/sConclusion
Creating a counter that consistently outperforms LongAdder is difficult. The most impactful optimizations are:
Striping – basic but essential.
Modulo (bit‑mask) optimization – simple improvement.
False‑sharing elimination – large performance gain.
Robust hash algorithm – the "black‑tech" that ensures stable high throughput.
All test code is available at https://github.com/lkxiaolou/all-in-one/tree/master/src/main/java/org/newboo/longadder . Recommended reading: “Benchmarking with JMH – 36 official examples”.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Xiao Lou's Tech Notes
Backend technology sharing, architecture design, performance optimization, source code reading, troubleshooting, and pitfall practices
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
