Fundamentals 17 min read

Physical Address Space Management and Memory Allocation in Linux (NUMA, Nodes, Zones, Pages, Slab, and Page Fault Handling)

This article explains how Linux manages physical address space using SMP and NUMA architectures, describes the node, zone, and page data structures, details page allocation via the buddy system and slab allocator, and outlines user‑ and kernel‑mode page‑fault handling, swapping, and address translation mechanisms.

360 Smart Cloud

Jun 1, 2021

Physical Address Space Management and Memory Allocation in Linux (NUMA, Nodes, Zones, Pages, Slab, and Page Fault Handling)

In traditional x86 SMP systems all CPUs share a single memory bus, which becomes a bottleneck; NUMA (Non‑Uniform Memory Access) solves this by giving each CPU a local memory node that is used first, improving performance and scalability.

Linux represents each NUMA node with

typedef struct pglist_data {

struct zone node_zones[MAX_NR_ZONES];

struct zonelist node_zonelists[MAX_ZONELISTS];

int nr_zones;

#ifdef CONFIG_FLAT_NODE_MEM_MAP

struct page *node_mem_map;

#ifdef CONFIG_PAGE_EXTENSION

...

unsigned long node_start_pfn;

unsigned long node_present_pages;

unsigned long node_spanned_pages;

int node_id;

...

} pg_data_t;

and stores key fields such as node_id , node_mem_map , node_start_pfn , node_spanned_pages , and node_present_pages . Each node contains an array of zones.

Zones are defined by

enum zone_type {

#ifdef CONFIG_ZONE_DMA

ZONE_DMA,

#endif

#ifdef CONFIG_ZONE_DMA32

ZONE_DMA32,

#endif

ZONE_NORMAL,

#ifdef CONFIG_HIGHMEM

ZONE_HIGHMEM,

#endif

ZONE_MOVABLE,

#ifdef CONFIG_ZONE_DEVICE

ZONE_DEVICE,

#endif

__MAX_NR_ZONES

};

covering DMA regions, normal memory, high memory, movable pages, and device memory.

Pages are the basic allocation unit. The kernel uses struct page to describe each physical page. Large allocations use the buddy system: alloc_pages (which calls alloc_pages_current, then __alloc_pages_nodemask) to obtain a contiguous block of pages from the free_area lists in each zone. Small allocations use the slab allocator, where each kernel object type has a kmem_cache created by kmem_cache_create. A slab consists of one or more pages; per‑CPU caches ( cpu_slab) provide fast allocation, falling back to __slab_alloc and new_slab_objects when needed. Slab states (full, partial, free) are tracked with list_head structures.

When a process accesses a virtual address without a backing physical page, a page‑fault occurs. The kernel distinguishes user‑mode and kernel‑mode faults. For user faults, do_user_addr_fault locates the vm_area_struct, then calls handle_mm_fault → __handle_mm_fault → handle_pte_fault. If the PTE is empty, do_anonymous_page allocates a new page via the buddy system and installs it with mk_pte and set_pte_at. If the PTE points to a swapped‑out page, do_swap_page restores it from the swap file. For file‑backed mappings, do_fault eventually calls filemap_fault, which may reuse a cached page or allocate a new one and fill it via the filesystem’s readpage operation (e.g., ext4_readpage).

Swapping moves rarely used pages to a swap area on disk; the kernel’s kswapd daemon periodically scans LRU lists and invokes shrink_node_memcgs to reclaim pages. Address translation is accelerated by the TLB, which caches recent page‑table entries.

Kernel‑mode mappings are set up early during boot. The top‑level page directory ( swapper_pg_dir) is initialized and loaded with load_cr3. Assembly code in arch/x86/kernel/head_64.S defines the initial page tables, for example:

#if defined(CONFIG_XEN_PV) || defined(CONFIG_PVH)</code>
<code>SYM_DATA_START_PTI_ALIGNED(init_top_pgt)</code>
<code>  .quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC</code>
<code>  .org    init_top_pgt + L4_PAGE_OFFSET*8, 0</code>
<code>  .quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC</code>
<code>  .org    init_top_pgt + L4_START_KERNEL*8, 0</code>
<code>  .quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC</code>
<code>  .fill  PTI_USER_PGD_FILL,8,0</code>
<code>SYM_DATA_END(init_top_pgt)</code>
<code>#else</code>
<code>SYM_DATA_START_PTI_ALIGNED(init_top_pgt)</code>
<code>  .fill  512,8,0</code>
<code>  .fill  PTI_USER_PGD_FILL,8,0</code>
<code>SYM_DATA_END(init_top_pgt)</code>
<code>#endif

In summary, Linux uses a multi‑level page‑table scheme to keep the virtual address space sparse, employs the buddy system for large page allocations and the slab allocator for small objects, organizes memory per NUMA node and zone, handles page faults by allocating or swapping pages, and relies on the TLB for fast address translation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Memory Management linux Page Fault NUMA Slab Allocator

Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.