Fundamentals 12 min read

Understanding malloc/free: ptmalloc internals and high‑performance memory pooling on Linux

The article dissects glibc's ptmalloc implementation, explaining malloc_state, malloc_chunk, fastbins, bins, and top chunk, then analyzes lock contention and fragmentation in multithreaded scenarios through a custom stress test, and concludes with recommendations for high‑performance memory management.

Linux Kernel Journey
Linux Kernel Journey
Linux Kernel Journey
Understanding malloc/free: ptmalloc internals and high‑performance memory pooling on Linux

1. ptmalloc Architecture

glibc's memory allocator resides in the ptmalloc module (source: glibc/malloc/malloc.c). It manages memory via several core structures:

struct malloc_state : contains a mutex, flags, fastbins, top chunk pointer, bins array, binmap, and a linked‑list pointer to the next state.

struct malloc_chunk : represents an individual chunk with fields mchunk_prev_size, mchunk_size, forward and backward pointers ( fd, bk), and size‑class specific links ( fd_nextsize, bk_nextsize).

fastbins : ten arrays storing free chunks of 16‑160 bytes, each as a singly‑linked list.

bins : 128 entries split into unsortedbins, smallbins (32‑1008 bytes) and largebins (>1024 bytes).

top chunk : the large “super‑chunk” from which new allocations are carved when no suitable free chunk exists.

At process start the main thread creates the primary main_state (main allocation arena) using brk or mmap. Additional thread_state arenas are created on demand for multithreaded programs; they are allocated only via mmap to reduce lock contention.

2. malloc and free Flow

When a program calls malloc, the allocator first searches the appropriate free‑list (fastbins or bins). If none matches, it trims the top chunk . If the top chunk lacks sufficient space, the allocator expands the arena via brk or mmap. The free routine returns a chunk to the corresponding free‑list; large chunks may be returned to the kernel with munmap or brk.

3. High‑Concurrency Testing

The author provides a multithreaded stress test that repeatedly allocates and optionally leaks memory. Parameters:

#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <pthread.h>
#include <stdatomic.h>
#define THREAD_NUM (8)
#define SIZE_THREHOLD (1024)   // allocation size limit
#define LEAK_THREHOLD (64)    // leak threshold

atomic_int leak_size;

int get_size() { srand(time(0)); return rand() % SIZE_THREHOLD; }

void *do_malloc(void *arg) {
    while (1) {
        int size = get_size();
        char *p = malloc(size);
        if (size >= LEAK_THREHOLD) {
            free(p);
        } else {
            atomic_fetch_add(&leak_size, size);
            printf("leak size:%d KB
", leak_size / 1024);
        }
    }
    return NULL;
}

int main(int argc, char *argv[]) {
    atomic_init(&leak_size, 0);
    pthread_t th[THREAD_NUM];
    for (int i = 0; i < THREAD_NUM; i++) {
        pthread_create(&th[i], NULL, do_malloc, NULL);
    }
    do_malloc(NULL);
    for (int i = 0; i < THREAD_NUM; i++) {
        pthread_join(th[i], NULL);
    }
    return 0;
}

Two test dimensions are measured:

Lock contention : observed with strace -tt -f -e trace=futex ./a.out, showing frequent futex system calls, confirming that malloc / free acquire the internal mutex on each allocation.

Memory fragmentation : measured by tracking total used memory, leaked memory, and computing fragmentation = used – leak. The test records:

Initial leak: 1.8 MB with 2724 MB free memory. After running, leak grows to 162 MB , leaving 2474 MB free. Hence user‑program memory usage is 250 MB , of which 88 MB is fragmented.

Additional strace -tt -f -e trace=mprotect,brk ./a.out shows frequent mprotect or brk calls, indicating that the allocator repeatedly expands the arena under load.

4. Conclusion

In multithreaded workloads, the default malloc / free suffer from frequent lock acquisition, degrading allocation throughput.

Long‑lived allocations cause significant internal fragmentation, as demonstrated by the 88 MB fragment after the test.

For low‑contention scenarios the standard allocator is acceptable, but high‑concurrency or long‑lived memory‑intensive applications should adopt a custom memory pool or arena‑based allocator to avoid lock contention and fragmentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Concurrencymallochigh performancememory allocationglibcptmallocmemory pool
Linux Kernel Journey
Written by

Linux Kernel Journey

Linux Kernel Journey

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.