Why Does Linux Need a Slab Allocator? Unveiling the Secrets of Kernel Memory Pools
This article revisits Linux memory allocation, then dives deep into the slab allocator—explaining its relationship with the buddy system, its internal layout, the differences between slab, slub and slob, and how the kernel uses slab caches, per‑CPU caches, and NUMA nodes to efficiently allocate and free small memory objects.
1. Review of Previous Content
In earlier articles on memory management, we gave a macro‑level overview of the entire Linux memory allocation chain. This article continues the theme of memory allocation, but now we explore the memory pool used for allocating small, fragmented memory blocks—the slab allocator.
We first briefly recap the core concepts of Linux memory allocation from a macro perspective to provide a smooth transition to the micro perspective.
We will only review the following content briefly; focus on the overall macro process.
In " Deep Understanding of Linux Physical Memory Allocation and Release Full Chain Implementation , the author introduced the API‑based physical memory allocation and release process and the related kernel source code.
The full chain of physical memory allocation in the kernel is shown below:
The core of physical memory allocation in the Linux kernel is the buddy system . After understanding the overall physical memory allocation flow, we move on to the entry function get_page_from_freelist of the buddy system, whose complete flow is shown below:
The kernel iterates over each NUMA node's memory zones to find a zone with enough free pages. Once a suitable zone is found, the kernel calls rmqueue to allocate pages from that zone's buddy system.
Why does the kernel need a slab memory pool when it already has a buddy system? This question opens the main discussion of this article.
2. Why Do We Still Need Slab When the Buddy System Exists?
From the previous article " Deep Analysis of Linux Buddy System Design and Implementation ", we learned that the buddy system manages memory in units of pages.
The buddy system divides free memory in a zone into blocks whose sizes are powers of two, ranging from 1 page up to 1024 pages.
These blocks are linked together in a struct free_area list called free_list:
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};Blocks of the same size are further classified by migration type (e.g., MIGRATE_UNMOVABLE, MIGRATE_MOVABLE, MIGRATE_RECLAIMABLE).
All memory blocks of the same size are organized in the same free_list . Different sizes are managed by different free_area structures.
Memory blocks allocated by the buddy system are physically contiguous and can only be allocated in powers‑of‑two numbers of pages.
After the buddy system, the kernel still needs a slab memory pool for allocating small objects efficiently.
When the kernel needs to allocate frequently used core objects such as task_struct, mm_struct, struct page, struct file, or struct socket, it creates a dedicated slab cache for each object type. The slab cache pre‑allocates one or more whole pages from the buddy system and then subdivides those pages into equal‑sized small blocks matching the object size.
Allocating and freeing objects through a slab cache avoids the long allocation chain of the buddy system, reduces cache line pollution, and improves performance.
For more details on pooling ideas and object‑pool implementations, see the author's previous article "Detailed Design and Implementation of Netty Recycler Object Pool".
Benefits of using slab object pools include:
Higher CPU cache hit rate because recently freed objects stay hot in the cache.
Reduced cache pressure compared to allocating whole pages for tiny objects.
Less instruction‑cache and data‑cache pollution caused by long buddy‑system calls.
Better cache‑line utilization by avoiding false sharing.
3. Application Scenarios of Slab Object Pools in the Kernel
Below are several typical scenarios where the kernel uses slab caches:
This section provides an overview; detailed code is omitted.
When fork() creates a new process, the kernel allocates a task_struct from the task_struct slab cache.
When creating a virtual memory space, the kernel allocates an mm_struct from its dedicated slab cache.
When looking up a page in the page cache, the kernel allocates a struct page from its slab cache.
When open() opens a file, the kernel allocates a struct file from its slab cache.
When a server accepts a new TCP connection, the kernel allocates a struct socket from its slab cache.
In addition to the objects listed above, virtually every kernel core object that is frequently created and destroyed is managed by a slab cache.
Examples include epoll items, page‑cache pages, and virtual memory area structures, as described in the author's other articles.
4. Slab, Slub, and Slob – Which One Is Which?
Linux provides three implementations of the slab allocator:
Slab – Introduced in Solaris 2.4 by Jeff Bonwick and adopted by Linux 2.0. It has many management queues and metadata, which can waste memory on large servers.
Slub – Added in Linux 2.6.22 (2007) by Christoph Lameter. It simplifies the design, removes many queues, and optimizes for SMP and NUMA, becoming the default implementation.
Slob – Introduced in Linux 2.6.16 (2006) for embedded systems with very limited memory; it is a very lightweight allocator.
All kernel memory‑pool APIs are named with the slab prefix, and the actual implementation can be switched via configuration. This article focuses on the slub implementation, which is the one used by most server‑grade Linux distributions.
5. Starting From a Simple Page to Build a Slab
The kernel places frequently used core objects into slab caches. Each core object type gets its own slab cache, which improves allocation, access, and release performance.
A slab cache is built on top of the buddy system. It first requests one or more whole pages from the buddy system, then subdivides those pages into equal‑sized blocks whose size matches the object size.
If the object size is not naturally aligned to the CPU word size (8 bytes on 64‑bit CPUs), the kernel adds padding to achieve word‑size alignment, which reduces the number of memory accesses required for unaligned objects.
To detect out‑of‑bounds accesses, the kernel inserts a "red zone" filled with the byte pattern 0xbb before and after each object. When a program reads the red‑zone pattern, a fault is generated.
The kernel stores the free‑object pointer inside the object’s memory when the object is free, avoiding extra metadata allocation.
When SLAB_POISON is enabled, the kernel fills freed objects with the pattern 0x6b (and ends with 0xa5) to help detect use‑after‑free bugs. In that case the free‑pointer is stored separately.
If the kernel tracks allocation and free information ( SLAB_STORE_USER), two struct track structures are placed in each object.
In the kernel, a slab is represented by a struct page. For multi‑page slabs, newer kernels use struct slab to hold the slab‑specific fields.
struct page {
// Flags indicate whether this is a compound page (head) or a tail page
unsigned long flags;
unsigned long compound_head;
unsigned char compound_dtor;
unsigned char compound_order;
atomic_t compound_mapcount;
atomic_t compound_pincount;
// Slab‑related fields (union omitted for brevity)
struct kmem_cache *slab_cache;
void *freelist; // first free object
unsigned frozen:1; // 1 = cached in per‑CPU array_cache
};The size field in struct kmem_cache represents the real memory occupied by a slab object, including metadata such as red zones and tracking structures.
struct kmem_cache {
slab_flags_t flags; // configuration flags (e.g., RED_ZONE, POISON)
unsigned int size; // object size including metadata
unsigned int object_size; // raw object size without metadata
unsigned int offset; // offset of the free‑pointer inside the object
struct kmem_cache_order_objects oo; // high 16 bits: pages per slab, low 16 bits: objects per slab
struct kmem_cache_order_objects max;
struct kmem_cache_order_objects min;
gfp_t allocflags; // GFP flags for buddy allocation
int refcount;
void (*ctor)(void *);
unsigned int inuse; // object size after alignment (includes red zone if enabled)
unsigned int align; // alignment requested by the user
const char *name; // name shown in /proc/slabinfo
// Per‑CPU cache
struct kmem_cache_cpu __percpu *cpu_slab;
// NUMA node caches
struct kmem_cache_node *node[MAX_NUMNODES];
unsigned long min_partial; // limit of empty slabs per node
unsigned int cpu_partial; // limit of free objects cached per CPU
};The flag bits control features such as cache‑line alignment ( SLAB_HWCACHE_ALIGN), poisoning, red zones, DMA allocation, and user‑tracking.
Per‑CPU caches ( struct kmem_cache_cpu) hold a pointer to the current slab ( page) and a freelist of objects. When the current slab is full, the CPU looks at its partial list or the node’s partial list, and finally falls back to the buddy system.
struct kmem_cache_cpu {
void **freelist; // pointer to next free object in the current slab
unsigned long tid; // global transaction ID (used for preemption safety)
struct page *page; // the slab currently used for allocations
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct page *partial; // list of partially‑filled slabs for this CPU
#endif
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
};Each NUMA node has a struct kmem_cache_node that stores three slab lists: partial, full, and (when debugging) empty. The node also tracks the number of partial slabs ( nr_partial) and enforces min_partial to limit memory usage.
struct kmem_cache_node {
spinlock_t list_lock;
unsigned long nr_partial; // number of partial slabs cached on this node
struct list_head partial; // list of partially‑filled slabs
#ifdef CONFIG_SLUB_DEBUG
atomic_long_t nr_slabs; // total slabs on this node
atomic_long_t total_objects;// total objects cached on this node
struct list_head full; // list of fully‑used slabs
#endif
};The overall architecture consists of the global kmem_cache, per‑CPU caches, per‑node caches, and the underlying buddy system.
6. Slab Allocation Principles
Allocation proceeds through fast‑path and slow‑path cases:
6.1 Allocate Directly From Per‑CPU Cache
If the current per‑CPU slab has free objects, the kernel returns the first object from cpu_slab->freelist and updates the pointer.
6.2 Allocate From Per‑CPU Partial List
If the current slab is full, the kernel walks the per‑CPU partial list, picks a slab, makes it the current slab, and allocates from it.
6.3 Allocate From NUMA Node Cache
If both per‑CPU caches are empty, the kernel takes a slab from the node’s partial list, makes it the per‑CPU slab, and may refill the per‑CPU partial list with up to cpu_partial/2 slabs.
6.4 Allocate From Buddy System
If no slabs are available in the node cache, the kernel allocates new pages from the buddy system using the size described by oo (or falls back to min if memory is tight), initializes a new slab, and uses it as the per‑CPU slab.
7. Slab Freeing Principles
7.1 Free Object Belongs to the Current Per‑CPU Slab
The fast‑path frees the object back to the current per‑CPU slab, updates cpu_slab->freelist, and stores the previous freelist pointer inside the freed object.
7.2 Free Object Belongs to a Per‑CPU Partial Slab
The kernel adds the object to the slab’s freelist and updates the free‑pointer chain.
7.3 Freeing Turns a Full Slab Into a Partial Slab
The slab becomes partially free and is moved to the per‑CPU partial list. If the total number of free objects in the per‑CPU partial list exceeds cpu_partial, the kernel transfers all per‑CPU partial slabs to the node’s partial list.
7.4 Freeing Turns a Partial Slab Into an Empty Slab
When a slab becomes completely empty, it is placed into the node’s partial list. If the node’s nr_partial exceeds min_partial, the slab is returned to the buddy system.
8. Summary
This article built on the buddy system to introduce the slab allocator, explained why the kernel needs a dedicated pool for small objects, and described the internal layout of slab objects, including alignment, red zones, poisoning, and tracking.
We then walked through the complete slab architecture—from a single page to the full hierarchy of kmem_cache, per‑CPU caches, and NUMA node caches—illustrating how slabs are allocated and freed in various scenarios.
The four allocation paths are:
Direct allocation from the per‑CPU cache.
Allocation from the per‑CPU partial list.
Allocation from the NUMA node cache.
Allocation by requesting new pages from the buddy system.
The four freeing paths are:
Freeing to the current per‑CPU slab.
Freeing to a per‑CPU partial slab.
Turning a full slab into a partial slab and moving it to the per‑CPU partial list (or to the node cache if the per‑CPU limit is exceeded).
Turning a partial slab into an empty slab and moving it to the node cache (or back to the buddy system if the node limit is exceeded).
Future articles will verify these mechanisms against the actual Linux kernel source code.
Bin's Tech Cabin
Original articles dissecting source code and sharing personal tech insights. A modest space for serious discussion, free from noise and bureaucracy.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
