Backend Development 25 min read

How to Build a High‑Performance Object Pool for Multithreaded Systems

This article explores the motivation, design, implementation, and performance testing of a high‑performance object pool that reduces allocation overhead in multithreaded environments by using thread‑local storage, freelists, and lock‑optimized global resources.

FunTester

Oct 30, 2023

How to Build a High‑Performance Object Pool for Multithreaded Systems

Background

Memory pools speed up allocation of frequently requested memory but do not provide reuse for objects whose construction and destruction are expensive. When a system creates and destroys many objects rapidly, an object pool can eliminate the repeated malloc / free and constructor/destructor overhead.

Goals

Reuse objects to avoid frequent allocation/deallocation and reduce construction cost.

Achieve low latency allocation and release.

Provide thread‑safe access.

Support dynamic capacity growth.

Prefer returning already‑used objects.

Survey of Existing Object Pools

brpc object pool

Uses batch allocation and per‑thread free blocks. Allocation steps:

Check thread‑local free block; if present, pop an object.

If empty, try to obtain a block from the global pool.

If the global pool is empty, request a large memory chunk from the OS and carve the first object.

Release pushes the object into the thread‑local free array; when the array is full it is flushed to the global pool.

Go object pool

Each coroutine has a private pool and a shared pool. Private objects are lock‑free; shared objects require a mutex. Allocation order is private → shared → other coroutines’ shared pools, finally falling back to a user‑provided New function.

Netty recycler

Maintains a thread‑local Stack and a WeakOrderQueue for cross‑thread recycling. Each thread can hold up to 2 × CPU‑cores queues, each queue stores up to 16 objects per link. Allocation checks the local stack, then the associated queues, and creates a new object only if none are available.

Overall Design

The proposed pool combines a freelist , thread‑local storage (TLS) and multiple global resource pools . The freelist stores only a head pointer, so push/pop operations modify a single pointer, giving very low latency and reducing lock contention.

Component Details

Local Pool

One instance per thread, accessed via TLS. Holds a pointer to the current Block and a FreeSlots list. When the free‑slot list reaches a threshold it is returned to the global pool.

Global Pool

Manages BlockManager and FreeSlotsManager. Several global pool instances can be created to spread lock contention. BlockManager tracks BlockChunk s; FreeSlotsManager stores pointers to free‑slot chains.

Data Structures

union Slot { Slot *next_ = nullptr; T val_; };

struct Block { Slot slots_[kBlockSize]; size_t idx_ = 0; };

struct BlockChunk { Block blocks_[kBlockChunkSize]; size_t idx_ = 0; };

struct FreeSlots { Slot *head_ = nullptr; size_t length_ = 0; };

struct BlockManager { std::vector<BlockChunk*> block_chunks_; };

struct FreeSlotsManager { size_t free_num_ = 0; std::vector<Slot*> freeslots_ptrs; };

Allocation Flow

If the local FreeSlots has a free slot, pop it.

Otherwise try to pop a free‑slot chain from the global pool.

If that fails, allocate a slot from the local Block.

If the local block is exhausted, request a new block from the global pool.

If the global pool has no blocks, allocate a new BlockChunk and split a block for the local pool.

T* GetObject() {
    if (freeslots_.head_ != nullptr) {
        Slot<T>* res = freeslots_.head_;
        freeslots_.head_ = res->next_;
        --freeslots_.length_;
        return reinterpret_cast<T*>(res);
    } else if (global_pool_->PopFreeSlots(freeslots_)) {
        Slot<T>* res = freeslots_.head_;
        freeslots_.head_ = res->next_;
        --freeslots_.length_;
        return reinterpret_cast<T*>(res);
    } else if (block_->idx_ < kBlockSize) {
        return reinterpret_cast<T*>(&block_->slots_[block_->idx_++]);
    } else if (Block<T>* blk = global_pool_->PopBlock()) {
        block_ = blk;
        return reinterpret_cast<T*>(&block_->slots_[block_->idx_++]);
    }
    return nullptr;
}

Recycle Flow

Returned objects are inserted at the head of the local FreeSlots. When the list size reaches kFreeSlotsSize, the whole list is pushed back to the global pool.

void ReturnObject(T* obj) {
    Slot<T>* s = reinterpret_cast<Slot<T>*>(obj);
    s->next_ = freeslots_.head_;
    freeslots_.head_ = s;
    ++freeslots_.length_;
    if (freeslots_.length_ == kFreeSlotsSize) {
        global_pool_->PushFreeSlots(freeslots_);
    }
}

Lock Optimisation

The global pool uses two separate locks: a spin lock for FreeSlotsManager (short critical sections) and a mutex for BlockManager (potentially long allocations). This separation reduces overall contention and yields roughly a 9 % latency reduction.

Cache‑Line Alignment

Both LocalPool and GlobalPool are aligned to 64‑byte cache lines to avoid false sharing.

struct __attribute__((aligned(64))) LocalPool {
    GlobalPool<T>* global_pool_;
    Block<T>* block_;
    FreeSlots<T> freeslots_;
};

class __attribute__((aligned(64))) GlobalPool {
    BlockManager<T> block_manager_;
    FreeSlotsManager<T> freeslots_manager_;
    pthread_spinlock_t freeslots_lck_;
    pthread_mutex_t block_mtx_;
};

Branch‑Prediction Hints

Using __builtin_expect for unlikely paths (e.g., allocation failure) gives a modest 2 % speed gain.

BlockChunk<T>* new_chunk = new (std::nothrow) BlockChunk<T>;
if (unlikely(new_chunk == nullptr)) {
    return false;
}

Object Construction / Destruction on Reused Memory

Placement new and explicit destructor calls allow constructing objects in pooled memory without extra allocations.

template<class... Args>
void Construct(T* p, Args&&... args) {
    new (p) T(std::forward<Args>(args)...);
}
void Destroy(T* p) {
    p->~T();
}

Testing

Four test categories were executed:

Correctness – write and read values after allocation.

Reuse verification – allocate‑free‑allocate cycles while monitoring RSS to ensure memory is reused.

Leak detection – valgrind --tool=memcheck.

Performance – compare latency against glibc malloc/free, jemalloc and the brpc object pool using perf and custom benchmarks.

Results show >50 % latency reduction in single‑threaded tests and up to 60 % improvement over glibc in multi‑threaded scenarios, while memory consumption stays comparable to the brpc pool.

Conclusion

The pool demonstrates that a combination of thread‑local freelists, cache‑line‑aligned structures and fine‑grained locking can dramatically improve allocation performance in high‑concurrency back‑end services while keeping memory usage predictable and avoiding leaks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Memory Management concurrency C#multithreading lock optimization Object Pool cacheline alignment

Written by

FunTester

10k followers, 1k articles | completely useless

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.