Understanding the SLUB Memory Allocator: A Deep Dive into Linux Kernel Object Management
SLUB, the default Linux kernel memory allocator, reduces fragmentation and improves allocation speed for frequently created objects like task_struct and inode by using per‑CPU caches, object slabs, and NUMA‑aware node caches, with detailed structures, allocation/free paths, tuning parameters, and real‑world case studies.
Introduction
The SLUB memory allocator is a core component of the Linux kernel memory‑management subsystem. It addresses the heavy fragmentation caused by frequent creation and destruction of small kernel objects such as struct task_struct (process descriptors) and struct inode (file nodes). By simplifying the traditional SLAB design, SLUB provides high efficiency with low overhead and is the default allocator in mainstream kernels.
1. SLUB Overview
1.1 What is SLUB?
SLUB (Simple Slab Allocator) is a kernel‑level allocator optimized for small objects. The kernel’s buddy system handles large page‑size allocations (typically 4 KB or 8 KB), while SLUB manages fixed‑size object caches to avoid wasting whole pages for tiny structures.
1.2 Why Use SLUB?
Reduced fragmentation: Objects are packed into size‑specific caches, preventing the “large box for small items” problem of the buddy system.
Higher allocation speed: Frequently requested objects are pre‑filled in per‑CPU caches, eliminating costly searches.
Lower overhead: Metadata is stored directly in the page descriptor, and per‑CPU queues avoid global lock contention.
Excellent scalability: Each CPU has its own local cache, minimizing lock competition in multi‑core environments.
2. SLUB Data Structures
2.1 kmem_cache Structure
The kmem_cache struct is the top‑level control unit for a specific object type. Key members include: name: Identifier used in /proc/slabinfo. size: Size of each object. align: Alignment requirement. red_left_pad: Space reserved for auxiliary data (e.g., reference counts). flags: Feature bits such as SLAB_HWCACHE_ALIGN and SLAB_RECLAIM_ACCOUNT. object_size: size + red_left_pad, the actual memory occupied. ctor / dtor: Constructor and destructor callbacks invoked on allocation and free. cpu_slab: Per‑CPU cache pointer array. node: Array of kmem_cache_node for NUMA nodes.
#include <linux/types.h>
#include <linux/spinlock.h>
#include <linux/list.h>
#define MAX_NUMNODES 16
struct my_object {
int data;
int flag;
};
static void obj_ctor(void *obj) { ((struct my_object *)obj)->flag = 0; }
static void obj_dtor(void *obj) { ((struct my_object *)obj)->data = 0; }
struct kmem_cache *create_kmem_cache_example(void) {
struct kmem_cache *s = kzalloc(sizeof(*s), GFP_KERNEL);
s->name = "my_example_cache";
s->size = sizeof(struct my_object);
s->align = 16;
s->red_left_pad = 8;
s->flags = SLAB_HWCACHE_ALIGN | SLAB_RECLAIM_ACCOUNT;
s->object_size = s->size + s->red_left_pad;
s->ctor = obj_ctor;
s->dtor = obj_dtor;
return s;
}2.2 kmem_cache_cpu Structure
Each CPU owns a local cache to achieve lock‑free fast paths. Important fields: freelist: Pointer to the head of the free object list. tid: Transaction ID used for optimistic locking. page: Current slab page being used. partial: Pointer to a partially used slab.
#include <linux/types.h>
struct page;
struct kmem_cache_cpu {
void **freelist;
unsigned int tid;
struct page *page;
struct page *partial;
};
void *kmem_cache_cpu_alloc_example(struct kmem_cache_cpu *c) {
void *obj;
if (!c->freelist)
return NULL;
obj = c->freelist;
c->freelist = *(void **)obj;
c->tid++;
return obj;
}
void kmem_cache_cpu_free_example(struct kmem_cache_cpu *c, void *obj) {
*(void **)obj = c->freelist;
c->freelist = obj;
c->tid++;
}2.3 kmem_cache_node Structure
In NUMA systems, each memory node maintains its own slab lists to keep allocations local. list_lock: Spinlock protecting the node’s lists. nr_partial / nr_slabs: Counters for partial and total slabs. partial / full: List heads for partially‑filled and fully‑used slabs. total_objects: Total objects managed by the node.
#include <linux/types.h>
#include <linux/spinlock.h>
#include <linux/list.h>
#include <linux/atomic.h>
struct page;
struct kmem_cache_node {
raw_spinlock_t list_lock;
unsigned long nr_partial;
struct list_head partial;
unsigned long nr_slabs;
atomic_long_t total_objects;
struct list_head full;
};
void add_partial_slab_example(struct kmem_cache_node *n, struct page *page) {
raw_spin_lock(&n->list_lock);
list_add(&page->lru, &n->partial);
n->nr_partial++;
raw_spin_unlock(&n->list_lock);
}
void move_slab_to_full_example(struct kmem_cache_node *n, struct page *page) {
raw_spin_lock(&n->list_lock);
list_move(&page->lru, &n->full);
n->nr_partial--;
raw_spin_unlock(&n->list_lock);
}3. Allocation and Free Paths
3.1 Allocation Flow
When the kernel requests memory, SLUB first checks the per‑CPU freelist. If it is non‑empty, an object is taken directly (fast path). If empty, SLUB refills the local cache from the CPU’s partial list; if that is also empty, it pulls a slab from the node’s partial list; finally, if the node has no partial slabs, SLUB allocates new pages from the buddy system.
// Fast path: allocate from per‑CPU cache
void *slab_alloc_fastpath(struct kmem_cache_cpu *c) {
void *obj;
if (!c->freelist)
return NULL;
obj = c->freelist;
c->freelist = *(void **)obj;
c->tid++;
return obj;
}3.2 Free Flow
When an object is freed, it is returned to the CPU’s freelist. If the local free list becomes full, the slab is moved to the CPU’s partial list; if that list is also full, the slab is migrated back to the node’s partial list.
// Fast free: return to per‑CPU cache
void slab_free_fastpath(struct kmem_cache_cpu *c, void *obj) {
*(void **)obj = c->freelist;
c->freelist = obj;
c->tid++;
}3.3 Cache Mechanism Details
SLUB maintains three levels of caches:
Per‑CPU cache: Lock‑free, provides the fastest allocation for the owning CPU.
Node cache: Shared among CPUs on the same NUMA node, reduces cross‑node traffic.
Buddy system fallback: Supplies fresh pages when both local caches are exhausted.
This hierarchy dramatically reduces lock contention and improves data locality, which is critical for high‑concurrency workloads.
4. Tuning Strategies
4.1 Fragmentation Causes
Fragmentation appears as internal (unused space inside an allocated page) and external (scattered free blocks that cannot satisfy a larger request). Both degrade memory utilization and increase allocation latency.
4.2 Important Parameters
slab_min_objects: Minimum objects per slab; too large inflates free‑list length, too small wastes pages. slub_debug: Enables Red‑Zone, Poison, and other checks for overflow, use‑after‑free, and leakage detection. slab_max_order: Controls the maximum order of pages a slab can span; lowering it reduces large‑page fragmentation.
4.3 Optimization Techniques
Object alignment to cache‑line boundaries to avoid crossing pages.
Partial‑slab reuse: moving partially‑filled slabs between CPU and node caches.
Dedicated per‑CPU caches to eliminate global lock contention.
5. Performance Benefits
5.1 Reducing Lock Contention
Per‑CPU caches allow allocation and free without acquiring any global lock. When a global resource is needed, SLUB only locks the specific node’s list, not the entire allocator.
5.2 Data Locality
Objects of the same type are stored contiguously, improving both spatial and temporal locality. Frequently accessed objects stay in the CPU’s cache, reducing memory‑access latency.
5.3 Comparison with Other Allocators
SLAB: Maintains separate metadata per slab, leading to higher memory overhead and heavier lock contention.
SLOB: Simpler but uses a first‑fit algorithm with O(n) search time, making it slower for high‑frequency small allocations.
Benchmarks in virtual‑machine workloads show SLUB can reduce average allocation latency by 30‑50 % compared with SLAB, and outperform SLOB by 2‑3× in embedded scenarios.
6. Real‑World Cases
6.1 Case 1 – Allocation‑Speed Bottleneck
A big‑data processing platform suffered from high allocation latency because slab_min_objects was set to 1024, creating long free lists. Reducing the value to 128 (and setting max_objects to 256) shortened the free list and dropped average allocation time from 10 µs to 2 µs.
// Before tuning
cache->min_objects = 1024; // excessive, long free list
// After tuning
cache->min_objects = 128;
cache->max_objects = 256;6.2 Case 2 – Severe Fragmentation
A high‑concurrency web server experienced memory‑fragmentation‑induced allocation failures. Enabling SLAB_RED_ZONE and SLAB_POISON via slub_debug helped locate overflow bugs, while lowering slab_max_order from the default 3 to 1 reduced the size of allocated page blocks, raising memory utilization from 60 % to 80 %.
// Enable debugging flags
cache->flags |= SLAB_RED_ZONE | SLAB_POISON | SLAB_STORE_USER;
// Reduce allocation order
cache->max_order = 1; // default was 36.3 Leak Detection with slub_debug and kmemleak
Using slub_debug=Z (Red‑Zone) and the kernel’s kmemleak facility, developers can capture the allocation stack trace of leaked objects. The output shows the object address, size, allocating process, and backtrace, enabling precise pinpointing of leak sources.
# echo "slub_debug=Z" > /sys/kernel/slab/my_cache/debug
# echo scan > /sys/kernel/debug/kmemleak
# cat /sys/kernel/debug/kmemleak
unreferenced object 0xffff880012345678 (size 128):
comm "my_program", pid 1234, jiffies 4294967295
backtrace:
[<ffffffffc0123456>] my_function+0x34/0x80 [my_module]
[<ffffffffc01234ab>] another_function+0x56/0x90 [my_module]These diagnostics, combined with the tuning knobs described above, allow kernel engineers to maintain high allocation performance while keeping memory usage efficient and safe.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
