Fundamentals 46 min read

Understanding the SLUB Memory Allocator: A Deep Dive into Linux Kernel Object Management

SLUB, the default Linux kernel memory allocator, reduces fragmentation and improves allocation speed for frequently created objects like task_struct and inode by using per‑CPU caches, object slabs, and NUMA‑aware node caches, with detailed structures, allocation/free paths, tuning parameters, and real‑world case studies.

Deepin Linux

Apr 27, 2026

Understanding the SLUB Memory Allocator: A Deep Dive into Linux Kernel Object Management

Introduction

The SLUB memory allocator is a core component of the Linux kernel memory‑management subsystem. It addresses the heavy fragmentation caused by frequent creation and destruction of small kernel objects such as struct task_struct (process descriptors) and struct inode (file nodes). By simplifying the traditional SLAB design, SLUB provides high efficiency with low overhead and is the default allocator in mainstream kernels.

1. SLUB Overview

1.1 What is SLUB?

SLUB (Simple Slab Allocator) is a kernel‑level allocator optimized for small objects. The kernel’s buddy system handles large page‑size allocations (typically 4 KB or 8 KB), while SLUB manages fixed‑size object caches to avoid wasting whole pages for tiny structures.

1.2 Why Use SLUB?

Reduced fragmentation: Objects are packed into size‑specific caches, preventing the “large box for small items” problem of the buddy system.

Higher allocation speed: Frequently requested objects are pre‑filled in per‑CPU caches, eliminating costly searches.

Lower overhead: Metadata is stored directly in the page descriptor, and per‑CPU queues avoid global lock contention.

Excellent scalability: Each CPU has its own local cache, minimizing lock competition in multi‑core environments.

2. SLUB Data Structures

2.1 kmem_cache Structure

The kmem_cache struct is the top‑level control unit for a specific object type. Key members include: name: Identifier used in /proc/slabinfo. size: Size of each object. align: Alignment requirement. red_left_pad: Space reserved for auxiliary data (e.g., reference counts). flags: Feature bits such as SLAB_HWCACHE_ALIGN and SLAB_RECLAIM_ACCOUNT. object_size: size + red_left_pad, the actual memory occupied. ctor / dtor: Constructor and destructor callbacks invoked on allocation and free. cpu_slab: Per‑CPU cache pointer array. node: Array of kmem_cache_node for NUMA nodes.

#include <linux/types.h>
#include <linux/spinlock.h>
#include <linux/list.h>
#define MAX_NUMNODES 16

struct my_object {
    int data;
    int flag;
};

static void obj_ctor(void *obj) { ((struct my_object *)obj)->flag = 0; }
static void obj_dtor(void *obj) { ((struct my_object *)obj)->data = 0; }

struct kmem_cache *create_kmem_cache_example(void) {
    struct kmem_cache *s = kzalloc(sizeof(*s), GFP_KERNEL);
    s->name = "my_example_cache";
    s->size = sizeof(struct my_object);
    s->align = 16;
    s->red_left_pad = 8;
    s->flags = SLAB_HWCACHE_ALIGN | SLAB_RECLAIM_ACCOUNT;
    s->object_size = s->size + s->red_left_pad;
    s->ctor = obj_ctor;
    s->dtor = obj_dtor;
    return s;
}

2.2 kmem_cache_cpu Structure

Each CPU owns a local cache to achieve lock‑free fast paths. Important fields: freelist: Pointer to the head of the free object list. tid: Transaction ID used for optimistic locking. page: Current slab page being used. partial: Pointer to a partially used slab.

#include <linux/types.h>
struct page;

struct kmem_cache_cpu {
    void **freelist;
    unsigned int tid;
    struct page *page;
    struct page *partial;
};

void *kmem_cache_cpu_alloc_example(struct kmem_cache_cpu *c) {
    void *obj;
    if (!c->freelist)
        return NULL;
    obj = c->freelist;
    c->freelist = *(void **)obj;
    c->tid++;
    return obj;
}

void kmem_cache_cpu_free_example(struct kmem_cache_cpu *c, void *obj) {
    *(void **)obj = c->freelist;
    c->freelist = obj;
    c->tid++;
}

2.3 kmem_cache_node Structure

In NUMA systems, each memory node maintains its own slab lists to keep allocations local. list_lock: Spinlock protecting the node’s lists. nr_partial / nr_slabs: Counters for partial and total slabs. partial / full: List heads for partially‑filled and fully‑used slabs. total_objects: Total objects managed by the node.

#include <linux/types.h>
#include <linux/spinlock.h>
#include <linux/list.h>
#include <linux/atomic.h>
struct page;

struct kmem_cache_node {
    raw_spinlock_t list_lock;
    unsigned long nr_partial;
    struct list_head partial;
    unsigned long nr_slabs;
    atomic_long_t total_objects;
    struct list_head full;
};

void add_partial_slab_example(struct kmem_cache_node *n, struct page *page) {
    raw_spin_lock(&n->list_lock);
    list_add(&page->lru, &n->partial);
    n->nr_partial++;
    raw_spin_unlock(&n->list_lock);
}

void move_slab_to_full_example(struct kmem_cache_node *n, struct page *page) {
    raw_spin_lock(&n->list_lock);
    list_move(&page->lru, &n->full);
    n->nr_partial--;
    raw_spin_unlock(&n->list_lock);
}

3. Allocation and Free Paths

3.1 Allocation Flow

When the kernel requests memory, SLUB first checks the per‑CPU freelist. If it is non‑empty, an object is taken directly (fast path). If empty, SLUB refills the local cache from the CPU’s partial list; if that is also empty, it pulls a slab from the node’s partial list; finally, if the node has no partial slabs, SLUB allocates new pages from the buddy system.

// Fast path: allocate from per‑CPU cache
void *slab_alloc_fastpath(struct kmem_cache_cpu *c) {
    void *obj;
    if (!c->freelist)
        return NULL;
    obj = c->freelist;
    c->freelist = *(void **)obj;
    c->tid++;
    return obj;
}

3.2 Free Flow

When an object is freed, it is returned to the CPU’s freelist. If the local free list becomes full, the slab is moved to the CPU’s partial list; if that list is also full, the slab is migrated back to the node’s partial list.

// Fast free: return to per‑CPU cache
void slab_free_fastpath(struct kmem_cache_cpu *c, void *obj) {
    *(void **)obj = c->freelist;
    c->freelist = obj;
    c->tid++;
}

3.3 Cache Mechanism Details

SLUB maintains three levels of caches:

Per‑CPU cache: Lock‑free, provides the fastest allocation for the owning CPU.

Node cache: Shared among CPUs on the same NUMA node, reduces cross‑node traffic.

Buddy system fallback: Supplies fresh pages when both local caches are exhausted.

This hierarchy dramatically reduces lock contention and improves data locality, which is critical for high‑concurrency workloads.

4. Tuning Strategies

4.1 Fragmentation Causes

Fragmentation appears as internal (unused space inside an allocated page) and external (scattered free blocks that cannot satisfy a larger request). Both degrade memory utilization and increase allocation latency.

4.2 Important Parameters

slab_min_objects

: Minimum objects per slab; too large inflates free‑list length, too small wastes pages. slub_debug: Enables Red‑Zone, Poison, and other checks for overflow, use‑after‑free, and leakage detection. slab_max_order: Controls the maximum order of pages a slab can span; lowering it reduces large‑page fragmentation.

4.3 Optimization Techniques

Object alignment to cache‑line boundaries to avoid crossing pages.

Partial‑slab reuse: moving partially‑filled slabs between CPU and node caches.

Dedicated per‑CPU caches to eliminate global lock contention.

5. Performance Benefits

5.1 Reducing Lock Contention

Per‑CPU caches allow allocation and free without acquiring any global lock. When a global resource is needed, SLUB only locks the specific node’s list, not the entire allocator.

5.2 Data Locality

Objects of the same type are stored contiguously, improving both spatial and temporal locality. Frequently accessed objects stay in the CPU’s cache, reducing memory‑access latency.

5.3 Comparison with Other Allocators

SLAB: Maintains separate metadata per slab, leading to higher memory overhead and heavier lock contention.

SLOB: Simpler but uses a first‑fit algorithm with O(n) search time, making it slower for high‑frequency small allocations.

Benchmarks in virtual‑machine workloads show SLUB can reduce average allocation latency by 30‑50 % compared with SLAB, and outperform SLOB by 2‑3× in embedded scenarios.

6. Real‑World Cases

6.1 Case 1 – Allocation‑Speed Bottleneck

A big‑data processing platform suffered from high allocation latency because slab_min_objects was set to 1024, creating long free lists. Reducing the value to 128 (and setting max_objects to 256) shortened the free list and dropped average allocation time from 10 µs to 2 µs.

// Before tuning
cache->min_objects = 1024; // excessive, long free list

// After tuning
cache->min_objects = 128;
cache->max_objects = 256;

6.2 Case 2 – Severe Fragmentation

A high‑concurrency web server experienced memory‑fragmentation‑induced allocation failures. Enabling SLAB_RED_ZONE and SLAB_POISON via slub_debug helped locate overflow bugs, while lowering slab_max_order from the default 3 to 1 reduced the size of allocated page blocks, raising memory utilization from 60 % to 80 %.

// Enable debugging flags
cache->flags |= SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER;

// Reduce allocation order
cache->max_order = 1; // default was 3

6.3 Leak Detection with slub_debug and kmemleak

Using slub_debug=Z (Red‑Zone) and the kernel’s kmemleak facility, developers can capture the allocation stack trace of leaked objects. The output shows the object address, size, allocating process, and backtrace, enabling precise pinpointing of leak sources.

# echo "slub_debug=Z" > /sys/kernel/slab/my_cache/debug
# echo scan > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff880012345678 (size 128):
    comm "my_program", pid 1234, jiffies 4294967295
    backtrace:
        [<ffffffffc0123456>] my_function+0x34/0x80 [my_module]
        [<ffffffffc01234ab>] another_function+0x56/0x90 [my_module]

These diagnostics, combined with the tuning knobs described above, allow kernel engineers to maintain high allocation performance while keeping memory usage efficient and safe.

Performance Tuning Linux kernel memory allocation NUMA SLUB kmem_cache per-CPU cache

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

1. SLUB Overview

1.1 What is SLUB?

1.2 Why Use SLUB?

2. SLUB Data Structures

2.1 kmem_cache Structure

2.2 kmem_cache_cpu Structure

2.3 kmem_cache_node Structure

3. Allocation and Free Paths

3.1 Allocation Flow

3.2 Free Flow

3.3 Cache Mechanism Details

4. Tuning Strategies

4.1 Fragmentation Causes

4.2 Important Parameters

4.3 Optimization Techniques

5. Performance Benefits

5.1 Reducing Lock Contention

5.2 Data Locality

5.3 Comparison with Other Allocators

6. Real‑World Cases

6.1 Case 1 – Allocation‑Speed Bottleneck

6.2 Case 2 – Severe Fragmentation

6.3 Leak Detection with slub_debug and kmemleak

Deepin Linux

How this landed with the community

Was this worth your time?

0 Comments

6.1 Case 1 – Allocation‑Speed Bottleneck

6.2 Case 2 – Severe Fragmentation