Fundamentals 75 min read

How Linux’s Buddy System Manages Physical Memory: Deep Dive into Allocation and Reclamation

This article explains the Linux kernel’s buddy memory allocator, covering its core data structures, the concepts of buddies, allocation and fallback mechanisms, and the detailed process of freeing memory back to the buddy system, with code examples and diagrams.

Bin's Tech Cabin

Feb 3, 2023

How Linux’s Buddy System Manages Physical Memory: Deep Dive into Allocation and Reclamation

1. Core Data Structures of the Buddy System

The kernel uses struct zone to manage each physical memory region (NUMA node). Each zone contains a struct free_area free_area[MAX_ORDER] array, where each entry represents a list of free blocks of a particular order. The free_area holds an array of list_head free_list[MIGRATE_TYPES], one list per page‑migration type.

struct zone {
    struct free_area free_area[MAX_ORDER];
    // other fields omitted for brevity
};

struct free_area {
    struct list_head free_list[MIGRATE_TYPES];
    unsigned long nr_free;
};

The migration types are defined as:

enum migratetype {
    MIGRATE_UNMOVABLE,   // cannot be moved
    MIGRATE_MOVABLE,     // can be moved
    MIGRATE_RECLAIMABLE, // can be reclaimed
    MIGRATE_PCPTYPES,    // per‑CPU page types
    MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES, // emergency pages
#ifdef CONFIG_CMA
    MIGRATE_CMA,         // contiguous memory allocator
#endif
#ifdef CONFIG_MEMORY_ISOLATION
    MIGRATE_ISOLATE,     // cannot allocate from here
#endif
    MIGRATE_TYPES        // sentinel value
};

2. What Is a Buddy?

In the kernel, a buddy is a set of two or more contiguous pages of the same size. For order 0 a buddy is a single page; for order 1 it is two contiguous pages; for order 10 it is 1024 contiguous pages. Only pages that are physically adjacent and have the same order can be merged or split.

3. Allocation Principle of the Buddy System

When a request for order pages arrives, the allocator searches free_area[order]. If the list is empty, it looks at higher orders, takes a larger block, splits it repeatedly (halving each time), and inserts the unused halves back into the appropriate lower‑order lists until a block of the desired order is obtained.

The splitting process is implemented by expand():

static void expand(struct zone *zone, struct page *page,
                   unsigned int low, unsigned int high,
                   struct free_area *area, int migratetype)
{
    unsigned long size = 1UL << high;
    while (high > low) {
        area--;
        high--;
        size >>= 1;
        if (set_page_guard(zone, &page[size], high, migratetype))
            continue;
        add_to_free_area(&page[size], area, migratetype);
        set_page_order(&page[size], high);
    }
}

If no suitable block exists in any list, the allocator falls back to other migration types according to the fallbacks table:

static int fallbacks[MIGRATE_TYPES][3] = {
    [MIGRATE_UNMOVABLE]   = { MIGRATE_RECLAIMABLE, MIGRATE_MOVABLE,   MIGRATE_TYPES },
    [MIGRATE_MOVABLE]     = { MIGRATE_RECLAIMABLE, MIGRATE_UNMOVABLE, MIGRATE_TYPES },
    [MIGRATE_RECLAIMABLE] = { MIGRATE_UNMOVABLE,   MIGRATE_MOVABLE,   MIGRATE_TYPES },
};

The fallback algorithm searches higher‑order lists of the alternative migration types, optionally stealing a block and then splitting it down to the requested order.

4. Allocation Path in the Kernel

The high‑level function get_page_from_freelist() iterates over the zonelist (all zones allowed for the current allocation), checks watermarks, optionally performs node reclamation, and finally calls rmqueue() to obtain a page from the buddy system.

static struct page *get_page_from_freelist(gfp_t gfp_mask,
                                           unsigned int order,
                                           int alloc_flags,
                                           const struct alloc_context *ac)
{
    struct zoneref *z = ac->preferred_zoneref;
    for_each_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx, ac->nodemask) {
        unsigned long mark = wmark_pages(zone, ac->alloc_flags & ALLOC_WMARK_MASK);
        if (!zone_watermark_fast(zone, order, mark, ac->highest_zoneidx,
                                 alloc_flags, gfp_mask))
            continue;
        struct page *page = rmqueue(ac->preferred_zoneref->zone, zone,
                                   order, gfp_mask, alloc_flags,
                                   ac->migratetype);
        if (page)
            return page;
    }
    return NULL;
}

4.1 Fast Path for Order‑0 (pcp‑list)

For a single page the kernel first tries the per‑CPU page cache (pcp‑list). If the local CPU’s pcp‑list is empty, a batch of pages is taken from the buddy system and cached.

static struct page *rmqueue_pcplist(struct zone *preferred_zone,
                                    struct zone *zone, gfp_t gfp_flags,
                                    int migratetype, unsigned int alloc_flags)
{
    struct per_cpu_pages *pcp = &this_cpu_ptr(zone->pageset)->pcp;
    struct list_head *list = &pcp->lists[migratetype];
    struct page *page = __rmqueue_pcplist(zone, migratetype, alloc_flags,
                                          pcp, list);
    if (page)
        zone_statistics(preferred_zone, zone);
    return page;
}

5. Releasing Memory

Releasing memory uses __free_pages(). For order 0 the page is returned to the pcp‑list; for higher orders the page is returned to the buddy system via __free_one_page(), which may merge it with its buddy.

void __free_pages(struct page *page, unsigned int order)
{
    if (put_page_testzero(page))
        free_the_page(page, order);
}

static void free_the_page(struct page *page, unsigned int order)
{
    if (order == 0)
        free_unref_page(page);
    else
        __free_pages_ok(page, order);
}

5.1 Returning a Single Page to the pcp‑list

void free_unref_page(struct page *page)
{
    unsigned long pfn = page_to_pfn(page);
    unsigned long flags;
    local_irq_save(flags);
    free_unref_page_commit(page, pfn);
    local_irq_restore(flags);
}

static void free_unref_page_commit(struct page *page, unsigned long pfn)
{
    struct zone *zone = page_zone(page);
    struct per_cpu_pages *pcp = &this_cpu_ptr(zone->pageset)->pcp;
    int migratetype = get_pcppage_migratetype(page);
    if (migratetype >= MIGRATE_PCPTYPES) {
        if (is_migrate_isolate(migratetype)) {
            free_one_page(zone, page, pfn, 0, migratetype);
            return;
        }
        migratetype = MIGRATE_MOVABLE;
    }
    list_add(&page->lru, &pcp->lists[migratetype]);
    pcp->count++;
    if (pcp->count >= pcp->high) {
        unsigned long batch = READ_ONCE(pcp->batch);
        free_pcppages_bulk(zone, batch, pcp);
    }
}

5.2 Returning a Multi‑Page Block to the Buddy System

static void __free_one_page(struct page *page, unsigned long pfn,
                           struct zone *zone, unsigned int order,
                           int migratetype)
{
    spin_lock(&zone->lock);
    __free_one_page(page, pfn, zone, order, migratetype);
    spin_unlock(&zone->lock);
}

static void __free_one_page(struct page *page, unsigned long pfn,
                           struct zone *zone, unsigned int order,
                           int migratetype)
{
    unsigned long combined_pfn;
    unsigned long buddy_pfn;
    struct page *buddy;
    unsigned int max_order = MAX_ORDER;

    while (order < max_order - 1) {
        buddy_pfn = __find_buddy_pfn(pfn, order);
        if (!pfn_valid_within(buddy_pfn))
            break;
        buddy = page + (buddy_pfn - pfn);
        if (!page_is_buddy(page, buddy, order))
            break;
        del_page_from_free_area(buddy, &zone->free_area[order]);
        combined_pfn = buddy_pfn & pfn;
        page = page + (combined_pfn - pfn);
        pfn = combined_pfn;
        order++;
    }
    set_page_order(page, order);
    add_to_free_area(page, &zone->free_area[order], migratetype);
}

6. Summary

The buddy allocator is built around struct zone and its free_area arrays. Allocation walks from low to high order, splitting larger blocks as needed. If no block of the requested type is available, the allocator falls back to other migration types according to a predefined table. Freeing does the opposite: a block is merged with its buddy as long as the buddy is free and of the same order, otherwise the block is placed back into the appropriate free list. The per‑CPU page cache (pcp‑list) speeds up the common case of allocating or freeing a single page.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

memory-management Linux Buddy system

Written by

Bin's Tech Cabin

Original articles dissecting source code and sharing personal tech insights. A modest space for serious discussion, free from noise and bureaucracy.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.