Fundamentals 57 min read

Design and Implementation of a High‑Concurrency Memory Pool in C++

This article presents a comprehensive design and implementation of a high‑concurrency memory pool in C++, covering concepts such as fixed‑size block allocation, thread‑local caches, central and page caches, lock‑free techniques, span management, and performance benchmarking against standard malloc.

Deepin Linux

Aug 18, 2023

Design and Implementation of a High‑Concurrency Memory Pool in C++

The article introduces memory pools (fixed‑size‑block allocation) as an alternative to direct new / malloc calls, explaining how pre‑allocating large memory chunks reduces fragmentation and improves performance.

Basic concepts describe the pool, dynamic allocation, internal and external fragmentation, and the role of malloc as a built‑in pool.

Fixed‑size memory pool design outlines three steps: opening memory with malloc or system page allocation, allocating objects by carving the large block, and releasing objects back into a free list stored in the first few bytes of each block.

#pragma once
#include <iostream>
#include <vector>
#include <time.h>
using std::cout;
using std::endl;

template <class T>
class myObject {
public:
    T* New() {
        T* obj = nullptr;
        if (_freeList) {
            void* next = *(void**)_freeList;
            obj = (T*)_freeList;
            _freeList = next;
        } else {
            if (_remainBytes < sizeof(T)) {
                _remainBytes = 1024 * 128;
                _memory = (char*)SystemAlloc(_remainBytes >> 13);
                if (_memory == nullptr) throw std::bad_alloc();
            }
            obj = (T*)_memory;
            size_t objSize = sizeof(T) < sizeof(void*) ? sizeof(void*) : sizeof(T);
            _memory += objSize;
            _remainBytes -= objSize;
        }
        new(obj) T;
        return obj;
    }
    void Delete(T* obj) {
        obj->~T();
        *(void**)obj = _freeList;
        _freeList = obj;
    }
private:
    char* _memory = nullptr;
    size_t _remainBytes = 0;
    void* _freeList = nullptr;
};

Thread‑local cache (ThreadCache) provides lock‑free allocation for small objects. Each thread owns a ThreadCache with an array of FreeList buckets indexed by size class. Allocation first checks the local free list; if empty, it fetches a batch from the central cache using a slow‑start algorithm.

void* ThreadCache::Allocate(size_t size) {
    assert(size <= MAX_BYTES);
    size_t alignSize = SizeClass::RoundUp(size);
    size_t index = SizeClass::Index(size);
    if (!_freeLists[index].Empty())
        return _freeLists[index].Pop();
    else
        return FetchFromCentralCache(index, alignSize);
}

void* ThreadCache::FetchFromCentralCache(size_t index, size_t size) {
    size_t batchNum = std::min(_freeLists[index].MaxSize(), SizeClass::NumMoveSize(size));
    if (batchNum == _freeLists[index].MaxSize())
        ++_freeLists[index].MaxSize();
    void* start = nullptr; void* end = nullptr;
    size_t actualNum = CentralCache::GetInstance()->FetchRangeObj(start, end, batchNum, size);
    if (actualNum > 1)
        _freeLists[index].PushRange(NextObj(start), end, actualNum - 1);
    return start;
}

Central cache (CentralCache) holds spans of memory obtained from the page cache. It supplies batches of objects to thread caches and receives returned objects for reuse. Span selection, locking, and batch fetching are handled here.

size_t CentralCache::FetchRangeObj(void*& start, void*& end, size_t batchNum, size_t size) {
    size_t index = SizeClass::Index(size);
    _spanLists[index]._mtx.lock();
    Span* span = GetOneSpan(_spanLists[index], size);
    assert(span && span->_freeList);
    start = span->_freeList;
    end = start;
    size_t i = 0, actualNum = 1;
    while (i < batchNum - 1 && NextObj(end) != nullptr) {
        end = NextObj(end);
        ++i; ++actualNum;
    }
    span->_freeList = NextObj(end);
    NextObj(end) = nullptr;
    span->_useCount += actualNum;
    _spanLists[index]._mtx.unlock();
    return actualNum;
}

Page cache (PageCache) manages memory at the page granularity (8 KB pages). It allocates large spans from the OS, splits them into smaller spans when needed, and merges adjacent free spans to reduce external fragmentation.

Span* PageCache::NewSpan(size_t k) {
    assert(k > 0);
    if (k > NPAGES - 1) {
        void* ptr = SystemAlloc(k);
        Span* span = _spanPool.New();
        span->_pageId = ((PAGE_ID)ptr >> PAGE_SHIFT);
        span->_n = k;
        _idSpanMap.set(span->_pageId, span);
        return span;
    }
    if (!_spanLists[k].Empty())
        return _spanLists[k].PopFront();
    // search larger buckets, split, or allocate a big span (128 pages) and recurse
    // ... (omitted for brevity) ...
    return NewSpan(k);
}

Global allocation interface replaces malloc / free with ConcurrentAlloc and ConcurrentFree. For requests larger than MAX_BYTES (256 KB) it allocates whole pages via PageCache; otherwise it uses the thread‑local cache.

static void* ConcurrentAlloc(size_t size) {
    if (size > MAX_BYTES) {
        size_t alignSize = SizeClass::RoundUp(size);
        size_t kPage = alignSize >> PAGE_SHIFT;
        Span* span = PageCache::GetInstance()->NewSpan(kPage);
        span->_objSize = size;
        return (void*)(span->_pageId << PAGE_SHIFT);
    } else {
        if (pTLSThreadCache == nullptr) {
            static myObject<ThreadCache> tcPool;
            pTLSThreadCache = tcPool.New();
        }
        return pTLSThreadCache->Allocate(size);
    }
}

static void ConcurrentFree(void* ptr) {
    Span* span = PageCache::GetInstance()->MapObjectToSpan(ptr);
    size_t size = span->_objSize;
    if (size > MAX_BYTES) {
        PageCache::GetInstance()->_pageMtx.lock();
        PageCache::GetInstance()->ReleaseSpanToPageCache(span);
        PageCache::GetInstance()->_pageMtx.unlock();
    } else {
        assert(pTLSThreadCache);
        pTLSThreadCache->Deallocate(ptr, size);
    }
}

The implementation also integrates a radix‑tree based page‑to‑span mapping (TCMalloc_PageMap2) to replace a hash map, reducing lock contention and improving scalability on multi‑core systems.

Performance tests compare the custom allocator against the system malloc under multi‑threaded workloads. Results show a significant reduction in allocation/deallocation time, demonstrating the efficiency gains of the three‑level cache design.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance concurrency Thread Cache C++memory pool

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.