Why Linux Relies on Slab Allocators: Inside the Kernel’s Small Object Pool
This article explains how the Linux kernel supplements its buddy system with the slab allocator to efficiently manage tiny memory blocks, covering slab’s design, memory layout, red‑zone protection, poisoning, per‑CPU caches, NUMA node warehouses, allocation fast‑paths, slow‑paths, and release strategies.
1. Review
The author revisits the earlier series on Linux physical memory allocation, reminding readers of the buddy system that manages memory in powers‑of‑two pages.
We only need a high‑level overview here.
2. Why a Slab Is Still Needed
Although the buddy system can allocate contiguous pages, most kernel objects require only a few bytes, far less than a page. Repeatedly allocating whole pages for such tiny objects would waste memory and cache resources.
Therefore the kernel introduces a slab memory pool that obtains one or more pages from the buddy system and subdivides them into equal‑sized blocks for a specific object type.
3. Slab Object Pool Use Cases in the Kernel
Allocation of task_struct during fork() Allocation of mm_struct when creating a process address space
Allocation of struct page for the page cache
Allocation of struct file during open() Allocation of struct socket for incoming connections
In practice, virtually every frequently used kernel object has a dedicated slab cache.
4. Slab, Slub, and Slob
Three implementations exist:
Slab – the original design from Solaris, heavy on metadata and not NUMA‑aware.
Slub – introduced in Linux 2.6.22, simplifies metadata, improves NUMA and multi‑CPU performance, and is the default on servers.
Slob – a minimal version for embedded systems.
This article focuses on the slub implementation.
5. Starting from a Simple Page
Objects are placed in pages obtained from the buddy system. To avoid unaligned accesses, the kernel pads each object to the word size (8 bytes on 64‑bit CPUs) or to a cache‑line size when SLAB_HWCACHE_ALIGN is set.
Red zones (filled with 0xbb) are added before and after each object to detect out‑of‑bounds accesses.
When poisoning is enabled ( SLAB_POISON), freed objects are filled with 0x6b and terminated with 0xa5, which overwrites the free‑pointer; the kernel then stores the free‑pointer separately.
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};The final layout of a slab object includes the usable payload, optional red zones, optional tracking structures, and possible padding.
6. Overall Slab Architecture
6.1 Basic Information Management
The core structure is struct kmem_cache, which describes a slab cache:
struct kmem_cache {
slab_flags_t flags; /* features such as alignment, poison, red‑zone */
unsigned int size; /* real size of an object including metadata */
unsigned int object_size; /* size of the raw object */
unsigned int offset; /* offset of the free‑pointer inside the object */
struct kmem_cache_order_objects oo; /* high 16 bits: pages per slab, low 16 bits: objects per slab */
struct kmem_cache_order_objects max;
struct kmem_cache_order_objects min;
gfp_t allocflags; /* GFP flags used when requesting pages */
int refcount;
void (*ctor)(void *);
unsigned int inuse; /* aligned object size (includes red zone) */
unsigned int align; /* requested alignment */
const char *name; /* name shown in /proc/slabinfo */
struct list_head list; /* global list of all caches */
unsigned long min_partial; /* limit for empty slabs kept per NUMA node */
unsigned int cpu_partial; /* limit for free objects cached per‑CPU */
struct kmem_cache_cpu __percpu *cpu_slab;
struct kmem_cache_node *node[MAX_NUMNODES];
};Flag bits such as SLAB_RED_ZONE, SLAB_POISON, SLAB_HWCACHE_ALIGN, SLAB_STORE_USER, SLAB_CACHE_DMA, and SLAB_CACHE_DMA32 control the features described above.
6.2 Organization
Each cache has a per‑CPU structure struct kmem_cache_cpu that holds a pointer to the current slab ( page) and a freelist of objects:
struct kmem_cache_cpu {
void **freelist; /* next free object */
unsigned long tid; /* transaction id to detect CPU migration */
struct page *page; /* slab currently used for allocations */
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct page *partial; /* list of partially used slabs */
#endif
#ifdef CONFIG_SLUB_STATS
unsigned int stat[NR_SLUB_STAT_ITEMS];
#endif
};The slab itself is represented by struct page, which contains a freelist pointer and a frozen flag indicating that the slab is cached locally.
At a higher level, each NUMA node has a struct kmem_cache_node that stores three lists of slabs: partial, full, and (when debugging) empty. A spinlock protects these lists.
7. Slab Allocation Paths
7.1 Fast Path – Allocate from the Per‑CPU Slab
If cpu_slab->freelist is non‑NULL, the kernel returns the first free object and updates the pointer.
7.2 Slow Path – Use the Per‑CPU Partial List
When the current slab is exhausted, the kernel scans cpu_slab->partial for a slab with free objects, promotes it to cpu_slab->page, and allocates from it.
7.3 Slow Path – Pull from the NUMA Node Partial List
If the per‑CPU partial list is empty, the kernel takes a slab from node->partial, makes it the per‑CPU slab, and refills the per‑CPU partial list with up to cpu_partial/2 additional slabs.
7.4 Slow Path – Allocate New Slabs from the Buddy System
When both per‑CPU and node lists are empty, the kernel requests new pages from the buddy system using the size described by oo. If memory is scarce, it falls back to min, which guarantees at least one object.
8. Slab Freeing Paths
8.1 Return to the Current Per‑CPU Slab
If the object’s slab is the one cached in cpu_slab->page, the object is linked back onto cpu_slab->freelist (fast path).
8.2 Return to a Per‑CPU Partial Slab
If the slab resides in cpu_slab->partial, the object is added to that slab’s page->freelist.
8.3 Full → Partial Transition
When freeing an object from a full slab, the slab becomes partial. The kernel moves it into the per‑CPU partial list, unless the list would exceed cpu_partial. In that case all per‑CPU partial slabs are migrated to the node’s partial list first.
8.4 Partial → Empty Transition
If freeing the last object of a partial slab makes it empty, the slab is placed on the node’s partial list. If the node’s nr_partial exceeds min_partial, the slab is returned directly to the buddy system.
Summary
The slab allocator builds on the buddy system to provide fast, cache‑friendly allocation of small kernel objects. Its multi‑level hierarchy—per‑CPU caches, per‑node warehouses, and the underlying buddy allocator—allows the kernel to keep hot objects close to the CPU while gracefully handling memory pressure.
Bin's Tech Cabin
Original articles dissecting source code and sharing personal tech insights. A modest space for serious discussion, free from noise and bureaucracy.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
