Fundamentals 68 min read

How Linux Allocates Physical Memory: Inside the Kernel’s Buddy Allocator

This article walks through Linux kernel physical memory allocation, explaining the hierarchy of allocation interfaces, the role of gfp_mask and ALLOC flags, the fast and slow allocation paths, memory watermarks, NUMA zone handling, and the complex fallback mechanisms including compaction, direct reclaim, and OOM, all illustrated with code snippets and diagrams.

Bin's Tech Cabin

Dec 28, 2022

How Linux Allocates Physical Memory: Inside the Kernel’s Buddy Allocator

Previous Review

In the previous article "Deep Dive into Linux Physical Memory Management" the author introduced how the Linux kernel manages physical memory and the related kernel data structures.

Before discussing physical memory management, three Linux physical memory models were presented: FLATMEM, DISCONTIGMEM, and SPARSEMEM.

The author then described two physical memory architectures from a CPU‑centric view: UMA (uniform memory access) and NUMA (non‑uniform memory access).

In a NUMA system, only the DISCONTIGMEM and SPARSEMEM models are usable; the UMA model can use any of the three models.

The kernel uses the same data structures for both UMA and NUMA, treating UMA as a pseudo‑NUMA node with a single node.

The hierarchy of structures is struct pglist_data (NUMA node) → struct zone (physical memory region) → struct page (physical page).

The diagram shows the relationship between NUMA nodes, zones, and pages.

Physical memory is divided into nodes, each node into zones, and each zone into pages. A buddy system manages free pages within each zone.

The hierarchy is None → Zone → page .

The previous article also covered the struct page structure, which is the smallest unit of physical memory management.

Having built the architectural picture, the next question is how the kernel allocates physical memory based on this hierarchy.

1. Kernel Physical Memory Allocation Interfaces

Before describing allocation, the kernel provides several core interfaces, all based on the buddy system. The allocation must be in powers of two pages; the exponent is called the allocation order.

struct page *alloc_pages(gfp_t gfp, unsigned int order);

alloc_pages

returns a pointer to the first struct page of the allocated block.

When only a single page is needed, the macro alloc_page(gfp_mask) expands to alloc_pages(gfp_mask, 0).

#define alloc_page(gfp_mask) alloc_pages(gfp_mask, 0)

If the allocation fails, NULL is returned.

The vmalloc mechanism ultimately uses alloc_page .

To obtain a virtual address for the allocated pages, the kernel provides __get_free_pages, which returns the virtual address of the first page.

unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order);

The function works like alloc_pages but returns a virtual address.

Internally it calls alloc_pages and then translates the physical page to a virtual address with page_address. The translation is only valid for pages in the direct‑mapped region.

unsigned long __get_free_pages(gfp_t gfp_mask, unsigned int order) {<br/>    struct page *page;<br/>    // cannot allocate from high memory because it cannot be directly mapped<br/>    page = alloc_pages(gfp_mask & ~__GFP_HIGHMEM, order);<br/>    if (!page) return 0;<br/>    // convert to virtual address in the direct‑mapped area<br/>    return (unsigned long) page_address(page);<br/>}

page_address

converts a struct page to its virtual address, but only for pages in the direct‑mapped region. For high‑memory pages, the kernel must first map them with kmap.

If you missed the “Permanent Mapping Area” section, see the earlier article.

There is also a macro __get_free_page for allocating a single page, which simply calls __get_free_pages(..., 0).

#define __get_free_page(gfp_mask) \
  __get_free_pages((gfp_mask), 0)

All allocation functions return a struct page pointer (or a virtual address) that contains random data initially. For security, the kernel provides get_zeroed_page, which returns a zero‑filled page.

unsigned long get_zeroed_page(gfp_t gfp_mask) {<br/>    return __get_free_pages(gfp_mask | __GFP_ZERO, 0);<br/>}

Other helpers include __get_dma_pages for DMA‑capable memory.

unsigned long __get_dma_pages(gfp_t gfp_mask, unsigned int order);

All these functions return 0 on failure.

All allocations from these functions are physically contiguous.

To free pages, the kernel provides __free_pages and free_pages (the latter takes a virtual address).

void __free_pages(struct page *page, unsigned int order);<br/>void free_pages(unsigned long addr, unsigned int order);

__free_pages frees a block of 2^order pages starting at the given struct page pointer.

free_pages frees a block using its virtual address.

Incorrect pointers, wrong node IDs, or wrong orders can crash the system; the kernel trusts itself in kernel space.

There are also __free_page and free_page macros for single‑page deallocation.

#define __free_page(page) __free_pages((page), 0)<br/>#define free_page(addr) free_pages((addr), 0)

The gfp_t mask controls where pages are allocated from and how the allocation behaves. The next section explains these masks.

2. The gfp_mask that Governs Allocation Behaviour

The previous article introduced the four physical zones: ZONE_DMA, ZONE_DMA32, ZONE_NORMAL, and ZONE_HIGHMEM. The current article focuses on the zone modifiers defined in /include/linux/gfp.h:

#define ___GFP_DMA      0x01u<br/>#define ___GFP_HIGHMEM  0x02u<br/>#define ___GFP_DMA32    0x04u<br/>#define ___GFP_MOVABLE  0x08u

No ___GFP_NORMAL is defined because normal allocations default to ZONE_NORMAL, with fallback to ZONE_DMA if needed.

The kernel decides the target zone with the function gfp_zone(gfp_t flags), which returns the highest‑priority zone that satisfies the mask. If that zone is exhausted, the kernel falls back in the order ZONE_HIGHMEM → ZONE_NORMAL → ZONE_DMA.

static inline enum zone_type gfp_zone(gfp_t flags) {<br/>    enum zone_type z;<br/>    int bit = (int)(flags & GFP_ZONEMASK);<br/>    z = (GFP_ZONE_TABLE >> (bit * GFP_ZONES_SHIFT)) &<br/>        ((1 << GFP_ZONES_SHIFT) - 1);<br/>    VM_BUG_ON((GFP_ZONE_BAD >> bit) & 1);<br/>    return z;<br/>}

Older kernels (2.6.24) had a more readable implementation, shown in the article.

3. Core Allocation Function __alloc_pages

This function is the heart of the zoned buddy allocator.

struct page *__alloc_pages(gfp_t gfp, unsigned int order,<br/>                         int preferred_nid, nodemask_t *nodemask) {<br/>    struct page *page;<br/>    unsigned int alloc_flags = ALLOC_WMARK_LOW;<br/>    gfp_t alloc_gfp;<br/>    struct alloc_context ac = {};<br/>    // Validate order<br/>    if (WARN_ON_ONCE_GFP(order >= MAX_ORDER, gfp))<br/>        return NULL;<br/>    // Allow sleeping if needed<br/>    gfp &= gfp_allowed_mask;<br/>    alloc_gfp = gfp;<br/>    if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac, &alloc_gfp, &alloc_flags))<br/>        return NULL;<br/>    alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);<br/>    // Fast path: try to get a page from the free lists<br/>    page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);<br/>    if (likely(page))<br/>        goto out;<br/>    // Fast path failed – fall back to the slow path<br/>    alloc_gfp = gfp;<br/>    ac.nodemask = nodemask;<br/>    page = __alloc_pages_slowpath(alloc_gfp, order, &ac);<br/>out:<br/>    return page;<br/>}

The fast path attempts allocation while the zone’s free pages are above the WMARK_LOW watermark. If it fails, the slow path performs more aggressive actions such as waking kswapd, compaction, direct reclaim, and finally OOM.

3.1 prepare_alloc_pages

This helper fills an alloc_context structure that carries all information needed for the allocation.

static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,<br/>                                      int preferred_nid, nodemask_t *nodemask,<br/>                                      struct alloc_context *ac,<br/>                                      gfp_t *alloc_gfp, unsigned int *alloc_flags) {<br/>    ac->highest_zoneidx = gfp_zone(gfp_mask);<br/>    ac->zonelist = node_zonelist(preferred_nid, gfp_mask);<br/>    ac->nodemask = nodemask;<br/>    ac->migratetype = gfp_migratetype(gfp_mask);<br/>    if (cpusets_enabled()) {<br/>        *alloc_gfp |= __GFP_HARDWALL;<br/>        if (in_task() && !ac->nodemask)<br/>            ac->nodemask = &cpuset_current_mems_allowed;<br/>        else<br/>            *alloc_flags |= ALLOC_CPUSET;<br/>    }<br/>    might_sleep_if(gfp_mask & __GFP_DIRECT_RECLAIM);<br/>    if (should_fail_alloc_page(gfp_mask, order))<br/>        return false;<br/>    ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,<br/>                                            ac->highest_zoneidx, ac->nodemask);<br/>    return true;<br/>}

The function determines the highest‑priority zone, builds a zonelist that includes the preferred node and fallback nodes, records the migration type, and handles cgroup CPU‑set restrictions.

4. Slow Allocation Path __alloc_pages_slowpath

The slow path handles all corner cases. It repeatedly adjusts allocation flags, wakes kswapd, performs compaction, direct reclaim, and finally invokes the OOM killer if necessary.

static inline struct page *__alloc_pages_slowpath(gfp_t gfp_mask,<br/>                                                   unsigned int order,<br/>                                                   struct alloc_context *ac) {<br/>    bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;<br/>    const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;<br/>    struct page *page = NULL;<br/>    unsigned int alloc_flags;<br/>    unsigned long did_some_progress = 0;<br/>    enum compact_priority compact_priority;
    enum compact_result compact_result;
    int compaction_retries = 0;
    int no_progress_loops = 0;
    int reserve_flags = 0;

    if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) ==
                     (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
        gfp_mask &= ~__GFP_ATOMIC;

retry_cpuset:
    alloc_flags = gfp_to_alloc_flags(gfp_mask);
    ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
                                                ac->highest_zoneidx,
                                                ac->nodemask);
    if (!ac->preferred_zoneref->zone)
        goto nopage;
    if (alloc_flags & ALLOC_KSWAPD)
        wake_all_kswapds(order, gfp_mask, ac);

    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
    if (page)
        goto got_pg;

    if (can_direct_reclaim && (costly_order ||
        (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) &&
        !gfp_pfmemalloc_allowed(gfp_mask)) {
        page = __alloc_pages_direct_compact(gfp_mask, order,
                                          alloc_flags, ac,
                                          INIT_COMPACT_PRIORITY,
                                          &compact_result);
        if (page)
            goto got_pg;
        if (costly_order && (gfp_mask & __GFP_NORETRY)) {
            if (compact_result == COMPACT_SKIPPED ||
                compact_result == COMPACT_DEFERRED)
                goto nopage;
            compact_priority = INIT_COMPACT_PRIORITY;
        }
    }

retry:
    if (alloc_flags & ALLOC_KSWAPD)
        wake_all_kswapds(order, gfp_mask, ac);

    reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);
    if (reserve_flags)
        alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags);
    if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
        ac->nodemask = NULL;
        ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,
                                                    ac->highest_zoneidx,
                                                    ac->nodemask);
    }
    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
    if (page)
        goto got_pg;
    if (!can_direct_reclaim)
        goto nopage;
    page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                       &did_some_progress);
    if (page)
        goto got_pg;
    page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
                                       compact_priority, &compact_result);
    if (page)
        goto got_pg;
    if (gfp_mask & __GFP_NORETRY)
        goto nopage;
    if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
        goto nopage;
    if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                             did_some_progress > 0, &no_progress_loops))
        goto retry;
    if (did_some_progress > 0 &&
        should_compact_retry(ac, order, alloc_flags,
                            compact_result, &compact_priority,
                            &compaction_retries))
        goto retry;
    if (check_retry_cpuset(cpuset_mems_cookie, ac))
        goto retry_cpuset;
    page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
    if (page)
        goto got_pg;
    if (did_some_progress) {
        no_progress_loops = 0;
        goto retry;
    }

nopage:
    if (gfp_mask & __GFP_NOFAIL) {
        if (WARN_ON_ONCE_GFP(!can_direct_reclaim, gfp_mask))
            goto fail;
        WARN_ON_ONCE_GFP(current->flags & PF_MEMALLOC, gfp_mask);
        WARN_ON_ONCE_GFP(order > PAGE_ALLOC_COSTLY_ORDER, gfp_mask);
        page = __alloc_pages_cpuset_fallback(gfp_mask, order,
                                            ALLOC_HARDER, ac);
        if (page)
            goto got_pg;
        cond_resched();
        goto retry;
    }
fail:
    warn_alloc(gfp_mask, ac->nodemask,
               "page allocation failure: order:%u", order);
got_pg:
    return page;
}

The function first tries a more aggressive allocation (lower watermarks, wake kswapd). If that still fails, it may perform compaction, direct reclaim, and finally invoke the OOM killer. When the allocation is marked __GFP_NOFAIL, the kernel will keep retrying, possibly allocating from other NUMA nodes.

5. Overall Allocation Flow

The complete flow can be visualised as follows (simplified):

1. The caller invokes alloc_pages or one of its variants. 2. alloc_pages calls alloc_pages_node, which eventually calls __alloc_pages. 3. __alloc_pages attempts the fast path via get_page_from_freelist. 4. If the fast path fails, __alloc_pages_slowpath is entered, which may wake kswapd, compact memory, perform direct reclaim, or trigger OOM. 5. On success, a struct page * is returned; on failure, NULL is returned and a warning is printed.

The next article will dive into the buddy system implementation used by get_page_from_freelist and related functions.

Summary

The article started by listing the common kernel physical memory allocation interfaces and their use cases. It then traced the allocation flow in Linux kernel 5.19, showing how the fast path works above the WMARK_LOW watermark and how the slow path handles all corner cases using GFP and ALLOC flags, compaction, direct reclaim, and OOM. The accompanying diagrams illustrate the hierarchy of NUMA nodes, zones, and pages, as well as the overall allocation decision tree.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

memory-management Linux allocation NUMA Buddy Allocator gfp_mask

Written by

Bin's Tech Cabin

Original articles dissecting source code and sharing personal tech insights. A modest space for serious discussion, free from noise and bureaucracy.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.