How Netty’s ByteBuf Reference Counting Evolved: From Simple Counters to Parity‑Based Concurrency Safety
This article examines Netty 4.1.x’s ByteBuf reference‑counting mechanism, explains why reference counting was introduced, traces its original design, shows instruction‑level optimizations, reveals concurrency bugs in version 4.1.17, and details the clever even‑odd redesign that guarantees thread‑safe memory release while preserving high performance.
All examples are based on Netty 4.1.56.Final.
In the previous article the author introduced the ByteBuf architecture and highlighted Netty’s reference‑counting design, which is now examined in depth.
Netty adds reference counting to every ByteBuf by having it inherit AbstractReferenceCountedByteBuf, an implementation of the ReferenceCounted interface.
public interface ReferenceCounted {
int refCnt();
ReferenceCounted retain();
ReferenceCounted retain(int increment);
boolean release();
boolean release(int decrement);
}Each ByteBuf stores a refCnt field. refCnt() returns the current count, retain() increments it, and release() decrements it. When the count reaches zero Netty calls deallocate() to free the native memory and returns true; otherwise it returns false.
1. Why introduce reference counting?
When a ByteBuf is created in one thread and passed to other threads, each thread may retain or release it. Without a shared counter it is impossible to know when the buffer is no longer used, which can lead to delayed native‑memory release (because JDK DirectByteBuffer relies on GC) and eventual OOM.
Netty therefore requires explicit retain() and release() calls so that the buffer can be freed promptly. If a release is forgotten, Netty can detect the leak during GC and log an error via reportLeak().
2. Original reference‑count design (4.1.16.Final)
The initial implementation used a simple integer that started at 1. Every retain() added 1, every release() subtracted 1, and a CAS loop ensured atomic updates.
public abstract class AbstractReferenceCountedByteBuf extends AbstractByteBuf {
private static final AtomicIntegerFieldUpdater<AbstractReferenceCountedByteBuf> refCntUpdater =
AtomicIntegerFieldUpdater.newUpdater(AbstractReferenceCountedByteBuf.class, "refCnt");
private volatile int refCnt;
protected AbstractReferenceCountedByteBuf(int maxCapacity) {
super(maxCapacity);
refCntUpdater.set(this, 1);
}
private ByteBuf retain0(int increment) {
for (;;) {
int refCnt = this.refCnt;
int nextCnt = refCnt + increment;
if (nextCnt <= increment) {
throw new IllegalReferenceCountException(refCnt, increment);
}
if (refCntUpdater.compareAndSet(this, refCnt, nextCnt)) {
break;
}
}
return this;
}
private boolean release0(int decrement) {
for (;;) {
int refCnt = this.refCnt;
if (refCnt < decrement) {
throw new IllegalReferenceCountException(refCnt, -decrement);
}
if (refCntUpdater.compareAndSet(this, refCnt, refCnt - decrement)) {
if (refCnt == decrement) {
deallocate();
return true;
}
return false;
}
}
}
}This design is straightforward but uses the CMPXCHG instruction for every update.
3. Instruction‑level optimization (4.1.17.Final)
On x86 the XADD instruction is faster than CMPXCHG. Netty replaced the CAS loop with getAndAdd, which internally uses XADD, to improve performance.
public abstract class AbstractReferenceCountedByteBuf extends AbstractByteBuf {
private volatile int refCnt;
protected AbstractReferenceCountedByteBuf(int maxCapacity) {
super(maxCapacity);
refCntUpdater.set(this, 1);
}
private ByteBuf retain0(final int increment) {
int oldRef = refCntUpdater.getAndAdd(this, increment);
if (oldRef <= 0 || oldRef + increment < oldRef) {
refCntUpdater.getAndAdd(this, -increment);
throw new IllegalReferenceCountException(oldRef, increment);
}
return this;
}
private boolean release0(int decrement) {
int oldRef = refCntUpdater.getAndAdd(this, -decrement);
if (oldRef == decrement) {
deallocate();
return true;
} else if (oldRef < decrement || oldRef - decrement > oldRef) {
refCntUpdater.getAndAdd(this, decrement);
throw new IllegalReferenceCountException(oldRef, -decrement);
}
return false;
}
}The optimistic update improves throughput but introduces a subtle concurrency bug when multiple threads interleave retain() and release() around the moment the counter reaches zero.
4. Concurrency safety issue
Consider a buffer with refCnt = 1. Thread 1 calls release() (decrements to 0) and immediately proceeds to deallocate(). If Thread 2 concurrently executes retain() after the getAndAdd but before the CAS check, it sees the old value 0, increments to 1, and does not throw, leading to a use‑after‑free situation. Adding a third thread can exacerbate the problem, as illustrated by the series of diagrams in the original text.
5. Balancing performance and safety
Netty could revert to the older CMPXCHG‑based implementation (safe but slower) or keep the XADD optimization and fix the race. The final solution keeps the performance gain while eliminating the race.
6. Parity‑based reference counting (introduced in 4.1.32.Final)
Netty changes the raw counter semantics: even numbers represent “still referenced”, odd numbers represent “no references”. The logical reference count is obtained by rawCnt >>> 1 for even values; odd values map to logical zero.
Even → buffer is alive; logical count = rawCnt >>> 1.
Odd → buffer is dead; logical count = 0.
Initialization sets refCnt = 2 (even, meaning logical count 1). retain() adds 2, release() subtracts 2. When the logical count reaches zero, Netty sets the raw value to 1 (odd) instead of 0, preserving the “odd = dead” invariant. public final int initialValue() { return 2; } The parity design guarantees that once the raw counter becomes odd, any concurrent retain() will still produce an odd result and immediately throw IllegalReferenceCountException, preserving thread‑safe semantics.
7. Updated retain implementation
public final T retain(T instance) {
return retain0(instance, 1, 2);
}
public final T retain(T instance, int increment) {
int rawIncrement = checkPositive(increment, "increment") << 1;
return retain0(instance, increment, rawIncrement);
}
private T retain0(T instance, final int increment, final int rawIncrement) {
int oldRef = updater().getAndAdd(instance, rawIncrement);
if (oldRef != 2 && oldRef != 4 && (oldRef & 1) != 0) {
throw new IllegalReferenceCountException(0, increment);
}
if ((oldRef <= 0 && oldRef + rawIncrement >= 0) ||
(oldRef >= 0 && oldRef + rawIncrement < oldRef)) {
updater().getAndAdd(instance, -rawIncrement);
throw new IllegalReferenceCountException(realRefCnt(oldRef), increment);
}
return instance;
}The method first adds the raw increment using XADD, then checks whether the previous value was odd (dead) and throws if so. Overflow is detected and rolled back.
8. Updated release implementation
Release must atomically transition an even value of 2 to the odd dead value 1, therefore it uses a CAS loop (CMPXCHG) for the final step while still using a non‑volatile read to avoid memory‑barrier overhead.
public final boolean release(T instance) {
int rawCnt = nonVolatileRawCnt(instance);
return rawCnt == 2 ?
tryFinalRelease0(instance, 2) || retryRelease0(instance, 1) :
nonFinalRelease0(instance, 1, rawCnt, toLiveRealRefCnt(rawCnt, 1));
}
private int nonVolatileRawCnt(T instance) {
final long offset = unsafeOffset();
return offset != -1 ? PlatformDependent.getInt(instance, offset) : updater().get(instance);
}
private boolean tryFinalRelease0(T instance, int expectRawCnt) {
return updater().compareAndSet(instance, expectRawCnt, 1);
}
private boolean retryRelease0(T instance, int decrement) {
for (;;) {
int rawCnt = updater().get(instance);
int realCnt = toLiveRealRefCnt(rawCnt, decrement);
if (decrement == realCnt) {
if (tryFinalRelease0(instance, rawCnt)) {
return true;
}
} else if (decrement < realCnt) {
if (updater().compareAndSet(instance, rawCnt, rawCnt - (decrement << 1))) {
return false;
}
} else {
throw new IllegalReferenceCountException(realCnt, -decrement);
}
Thread.yield();
}
}The first non‑volatile read may be stale, but that is acceptable because the parity check only needs to know whether the value is odd or even. If the CAS fails, the loop retries until a consistent state is reached.
9. Summary of the four key optimizations
Replace CMPXCHG with the faster XADD instruction for most updates.
Introduce parity‑based reference counting to guarantee correct concurrent semantics.
Use equality checks ( ==) for the most common counter values (2 and 4) instead of bitwise & to reduce CPU work.
Avoid memory‑barrier costs by performing the first refCnt read with a non‑volatile Unsafe access.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bin's Tech Cabin
Original articles dissecting source code and sharing personal tech insights. A modest space for serious discussion, free from noise and bureaucracy.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
