Fundamentals 33 min read

Lock‑Free Queues: Use Cases, Implementations, and Performance Analysis

The article explains when lock‑free queues are needed, why they outperform locked alternatives, presents single‑producer and multi‑producer implementations with detailed code, benchmarks their throughput versus mutex‑based queues, and concludes they excel for low‑contention or dominant‑producer scenarios but may lag under high contention.

Tencent Cloud Developer

May 16, 2022

Lock‑Free Queues: Use Cases, Implementations, and Performance Analysis

This article introduces the scenarios where lock‑free queues are needed, explains why they are beneficial, and presents several implementations with detailed code analysis and performance comparison.

1. When to use a lock‑free queue

High‑frequency data streams (e.g., market data) that process millions of items per second may benefit from lock‑free queues. For low‑volume workloads, a simple mutex is sufficient.

2. Problems with locked queues

Lock contention can cause cache line invalidation, increased CPU stalls, and memory allocation overhead when many threads compete for the same lock.

3. One‑read‑one‑write implementation (yqueue/ypipe)

yqueue handles memory allocation and element storage, while ypipe provides a single‑producer/single‑consumer interface with three pointers (w, r, f) and a shared atomic pointer c for synchronization.

template<typename T, int N>
struct chunk_t {
    T values[N];
    chunk_t *prev;
    chunk_t *next;
};

inline void push() {
    back_chunk = end_chunk;
    back_pos = end_pos;
    if (++end_pos != N) return;
    chunk_t *sc = spare_chunk.xchg(NULL);
    if (sc) {
        end_chunk->next = sc;
        sc->prev = end_chunk;
    } else {
        end_chunk->next = (chunk_t *)malloc(sizeof(chunk_t));
        alloc_assert(end_chunk->next);
        end_chunk->next->prev = end_chunk;
    }
    end_chunk = end_chunk->next;
    end_pos = 0;
}

The flush function moves the write pointer w forward and updates the shared pointer c. It returns false when the reader thread is sleeping, prompting the writer to wake it.

inline bool flush() {
    if (w == f) return true;
    if (c.cas(w, f) != w) {
        c.set(f);
        w = f;
        return false;
    } else {
        w = f;
        return true;
    }
}

The write function only updates f; the queue becomes readable only after flush changes w. The read function first tries to prefetch data via check_read and then consumes it.

inline bool check_read() {
    if (&queue.front() != r && r) return true;
    r = c.cas(&queue.front(), NULL);
    if (&queue.front() == r || !r) return false;
    return true;
}

inline bool read(T *value_) {
    if (!check_read()) return false;
    *value_ = queue.front();
    queue.pop();
    return true;
}

4. Multi‑producer/multi‑consumer lock‑free queue (RingBuffer / ArrayLockFreeQueue)

The RingBuffer uses a circular array with four indices: m_writeIndex, m_readIndex, m_maximumReadIndex, and m_count. CAS operations reserve a slot for each producer and then publish the element by advancing m_maximumReadIndex.

template<typename ELEM_T, QUEUE_INT Q_SIZE = ARRAY_LOCK_FREE_Q_DEFAULT_SIZE>
class ArrayLockFreeQueue {
public:
    bool enqueue(const ELEM_T &a_data) {
        QUEUE_INT curWrite, curRead;
        do {
            curWrite = m_writeIndex;
            curRead = m_readIndex;
            if (countToIndex(curWrite + 1) == countToIndex(curRead))
                return false; // full
        } while (!CAS(&m_writeIndex, curWrite, curWrite + 1));
        m_thequeue[countToIndex(curWrite)] = a_data;
        while (!CAS(&m_maximumReadIndex, curWrite, curWrite + 1)) {
            sched_yield();
        }
        AtomicAdd(&m_count, 1);
        return true;
    }
    bool dequeue(ELEM_T &a_data) {
        QUEUE_INT curRead, curMaxRead;
        do {
            curRead = m_readIndex;
            curMaxRead = m_maximumReadIndex;
            if (countToIndex(curRead) == countToIndex(curMaxRead))
                return false; // empty
            a_data = m_thequeue[countToIndex(curRead)];
        } while (!CAS(&m_readIndex, curRead, curRead + 1));
        AtomicSub(&m_count, 1);
        return true;
    }
private:
    ELEM_T m_thequeue[Q_SIZE];
    volatile QUEUE_INT m_writeIndex;
    volatile QUEUE_INT m_readIndex;
    volatile QUEUE_INT m_maximumReadIndex;
    volatile QUEUE_INT m_count;
    inline QUEUE_INT countToIndex(QUEUE_INT a) { return a % Q_SIZE; }
};

The implementation discusses the ABA problem in the size() function and proposes adding an atomic m_count member (enabled by the macro ARRAY_LOCK_FREE_Q_KEEP_REAL_SIZE) to obtain a correct size.

5. Performance evaluation

Benchmarks compare lock‑free queues with mutex‑based queues, mutex + condition‑variable (BlockQueue), memory‑barrier linked‑list queues, and the RingBuffer. Results show:

Single‑producer/single‑consumer lock‑free queues outperform blocking queues.

With multiple producers, the second CAS (updating m_maximumReadIndex) becomes a bottleneck, reducing performance.

RingBuffer achieves the best throughput in 1‑writer‑4‑reader scenarios (up to 3× faster than memory‑barrier queues).

Using sched_yield() helps when the number of threads exceeds CPU cores, preventing busy‑waiting on the second CAS.

Additional observations include memory‑leak risks when storing smart pointers in the queue and the impact of cache line thrashing on multi‑producer performance.

6. Conclusions

Lock‑free queues avoid deadlocks and provide safe concurrent push/pop.

They are suitable when there is a single producer or when the producer is dominant and occasional extra producers exist.

For high‑contention multi‑producer workloads, traditional blocking queues may still be preferable.

The article also provides references to further reading on lock‑free data structures and related concurrency topics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

concurrency C++lock-free queue

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.