Unlocking Linux Memory Management: From CPU Access to CMA and Page Allocation
This comprehensive guide walks through Linux memory management, explaining CPU memory access, virtual‑to‑physical address translation, page‑table initialization, zone organization, the buddy allocator, slab allocator, vmalloc, page‑fault handling, and CMA, providing code examples and diagrams to form a complete understanding.
Linux Memory Management Overview
Linux memory management is a core topic for mastering the Linux kernel. This article consolidates fragmented knowledge into a single, coherent guide that covers the entire memory‑management stack, from CPU memory access to high‑level allocators such as CMA.
CPU Access to Memory
The CPU accesses memory through a series of steps illustrated by diagrams. Key concepts include:
TLB : a small fast cache that stores recent page‑table entries to avoid costly memory lookups.
Caches : L2 cache (on ARMv8) that speeds up CPU‑memory communication.
Virtual‑to‑Physical Address Translation
On ARM64 the virtual address space size is controlled by CONFIG_ARM64_VA_BITS (commonly 48 bits). Kernel space resides in the high half (0xFFFF0000_00000000‑0xFFFFFFFF_FFFFFFFF) and user space in the low half (0x00000000_00000000‑0x0000FFFF_FFFFFFFF). The kernel uses a four‑level page table hierarchy (PGD → PUD → PMD → PTE). The translation proceeds as follows:
Read the page‑directory base address from the CR3 register and use the first part of the linear address to index the PGD entry.
Read the pgd_t entry to obtain the physical base of the next‑level directory.
Repeat the indexing process for PUD, PMD, and finally PTE, each time adding the base address to the index derived from the linear address.
Combine the page‑frame base with the offset from the linear address to obtain the final physical address.
Each step fetches a physical page that contains the next level of the page table, making the process mechanical and index‑driven.
Linux Memory Initialization
During early boot the kernel creates the initial page tables in arch/arm64/kernel/head.S. The function create_page_tables performs two mappings:
Identity map : maps the idmap_text region so virtual and physical addresses are equal.
Kernel image map : maps the kernel’s text, rodata, data, and bss sections.
arch/arm64/kernel/head.S:
ENTRY(stext)
bl preserve_boot_args
bl el2_setup // Drop to EL1, w0=cpu_boot_mode
adrp x23, __PHYS_OFFSET
and x23, x23, MIN_KIMG_ALIGN - 1 // KASLR offset, defaults to 0
bl set_cpu_boot_mode_flag
bl __create_page_tables
bl __cpu_setup // initialise processor
b __primary_switch
ENDPROC(stext)The __create_page_tables routine calls create_pgd_entry to build PGD and intermediate levels, and create_block_map to map the final PTE entries.
Physical Memory Organization
Linux classifies physical memory into several concepts:
Node : Represents a NUMA node (non‑uniform memory access) or a single UMA node.
Zone : Divides memory into ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM, etc., each with its own allocation policies.
Page : A 4 KB physical page, represented by struct page.
Page frame : The physical storage unit for a page; the page‑frame number (PFN) is phys_addr >> PAGE_SHIFT.
Linux supports several memory models (e.g., CONFIG_FLATMEM, CONFIG_DISCONTIGMEM, CONFIG_SPARSEMEM_VMEMMAP), with ARM64 typically using the sparse model. The struct page array is mapped into the kernel virtual space vmemmap, so the address of a page descriptor is vmemmap + pfn.
Zoned Page‑Frame Allocator
The zoned page‑frame allocator manages all physical pages. Allocation follows a hierarchy:
If a request specifies ZONE_DMA, allocation is limited to that zone.
Otherwise the allocator tries ZONE_NORMAL → ZONE_DMA in order.
For ZONE_HIGHMEM requests it tries ZONE_HIGHMEM → ZONE_NORMAL → ZONE_DMA.
All allocation paths eventually call __alloc_pages_nodemask:
struct page *__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
int preferred_nid, nodemask_t *nodemask)
{
page = get_page_from_freelist(alloc_mask, order, alloc_flags, &ac);
...
page = __alloc_pages_slowpath(alloc_mask, order, &ac);
...
}Two allocation paths exist:
Fast path : tries per‑CPU caches and the buddy system.
Slow path : triggers reclamation, waiting, or page‑swap‑in when fast allocation fails.
Buddy Allocation Algorithm
The buddy system maintains free lists for block sizes 1, 2, 4, 8 … 1024 pages (up to 4 MiB). When a request cannot be satisfied at its size, a larger block is split; when a block is freed, adjacent buddies are merged.
static struct page *get_page_from_freelist(gfp_t gfp_mask, unsigned int order,
int alloc_flags, const struct alloc_context *ac)
{
for_next_zone_zonelist_nodemask(zone, z, ac->zonelist, ac->high_zoneidx, ac->nodemask) {
if (!zone_watermark_fast(zone, order, mark, ac_classzone_idx(ac), alloc_flags)) {
ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);
switch (ret) {
case NODE_RECLAIM_NOSCAN:
case NODE_RECLAIM_FULL:
continue;
default:
if (zone_watermark_ok(zone, order, mark, ac_classzone_idx(ac), alloc_flags))
goto try_this_zone;
continue;
}
}
try_this_zone:
page = rmqueue(ac->preferred_zoneref->zone, zone, order, gfp_mask, alloc_flags, ac->migratetype);
}
return NULL;
}The allocator first checks the zone’s low‑watermark; if insufficient free pages exist, it performs a quick reclamation before trying again.
Buddy Allocation Functions
static inline struct page *rmqueue(struct zone *preferred_zone,
struct zone *zone, unsigned int order,
gfp_t gfp_flags, unsigned int alloc_flags,
int migratetype)
{
if (likely(order == 0)) {
page = rmqueue_pcplist(preferred_zone, zone, order, gfp_flags, migratetype);
}
do {
page = NULL;
if (alloc_flags & ALLOC_HARDER)
page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if (!page)
page = __rmqueue(zone, order, migratetype);
} while (page && check_new_pages(page, order));
...
}Watermark Management
Each zone defines three watermarks (min, low, high) with the ratio 4:5:6. The kernel computes them based on total memory and per‑zone proportions. The behavior is:
If free pages < min, the zone is critically low and direct reclamation occurs.
If free pages < low, the kswapd daemon is awakened to reclaim pages.
If free pages > high, the zone is healthy and kswapd sleeps.
Memory Fragmentation and Compaction
Linux distinguishes internal and external fragmentation. Internal fragmentation occurs when a 4 KB page is allocated for a request smaller than 4 KB, leaving unused bytes. External fragmentation occurs when free pages are scattered and cannot satisfy a larger contiguous request.
Compaction works by scanning from both ends of a memory domain, collecting movable pages on the left and free pages on the right, then migrating the movable pages into the free slots to create a contiguous region.
Compaction Methods
static struct page *__alloc_pages_direct_compact(gfp_t gfp_mask, unsigned int order,
unsigned int alloc_flags,
const struct alloc_context *ac,
enum compact_priority prio,
enum compact_result *compact_result)
{
if (!order)
return NULL;
noreclaim_flag = memalloc_noreclaim_save();
*compact_result = try_to_compact_pages(gfp_mask, order, alloc_flags, ac, prio);
memalloc_noreclaim_restore(noreclaim_flag);
if (*compact_result <= COMPACT_INACTIVE)
return NULL;
count_vm_event(COMPACTSTALL);
page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if (page) {
struct zone *zone = page_zone(page);
zone->compact_blockskip_flush = false;
compaction_defer_reset(zone, order, true);
count_vm_event(COMPACTSUCCESS);
return page;
}
count_vm_event(COMPACTFAIL);
cond_resched();
return NULL;
}Slab Allocator
The slab allocator provides byte‑level allocations on top of the page‑based buddy system. Allocation proceeds through four steps:
Try the per‑CPU cache freelist.
If empty, try the per‑CPU partial list.
If still empty, try the node‑wide partial list.
Allocate a new slab from the buddy system.
vmalloc
vmallocmaps a set of non‑contiguous physical pages into a contiguous virtual address range. The process is:
Find a free virtual address hole between VMALLOC_START and VMALLOC_END.
Allocate the required number of pages with alloc_page.
Map each physical page into the selected virtual range.
Page‑Fault Handling
When a process accesses an unmapped virtual address, the CPU raises a page‑fault exception. The ARM64 exception vectors dispatch to el1_sync, which reads the ESR register to determine the fault class and jumps to the appropriate handler (e.g., el1_da for data abort).
static int __do_page_fault(struct mm_struct *mm, unsigned long addr,
unsigned int mm_flags, unsigned long vm_flags,
struct task_struct *tsk)
{
struct vm_area_struct *vma;
int fault;
vma = find_vma(mm, addr);
fault = VM_FAULT_BADMAP;
if (!vma)
goto out;
if (vma->vm_start > addr)
goto check_stack;
// permission check
if (!(vma->vm_flags & vm_flags)) {
fault = VM_FAULT_BADACCESS;
goto out;
}
return handle_mm_fault(vma, addr & PAGE_MASK, mm_flags);
check_stack:
if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
goto good_area;
out:
return fault;
} handle_mm_faultwalks the four‑level page table, allocating missing levels as needed, and finally calls handle_pte_fault to install the PTE.
Anonymous Page Faults
For anonymous mappings the first read maps the virtual page to the global zero page. A subsequent write triggers copy‑on‑write: a new physical page is allocated, zero‑filled, and the PTE is updated to point to the new page.
Swap‑in Faults
If the PTE points to a swap entry, do_swap_page looks up the page in the swap cache; if absent it allocates a new page, reads the data from swap, updates counters, and installs a new PTE.
Write‑Protect Faults (COW)
When a write‑protected page is written, do_wp_page handles copy‑on‑write by allocating a new page, copying the contents, and updating the PTE. Shared writable mappings skip the copy and simply adjust permissions.
Contiguous Memory Allocator (CMA)
CMA reserves a region of memory for allocating large contiguous blocks, typically for DMA buffers. The region can be defined via device‑tree reserved-memory nodes or kernel command‑line parameters ( cma=nn[M|G]@[start[-end]]).
static int __init early_cma(char *p)
{
size_cmdline = memparse(p, &p);
if (*p != '@') {
limit_cmdline = __pa(high_memory);
return 0;
}
base_cmdline = memparse(p + 1, &p);
if (*p != '-') {
limit_cmdline = base_cmdline + size_cmdline;
return 0;
}
limit_cmdline = memparse(p + 1, &p);
return 0;
}
early_param("cma", early_cma);During boot the CMA area is added to the buddy system with cma_init_reserved_areas, marking its pages with MIGRATE_CMA and freeing them back to the buddy allocator.
static int __init cma_init_reserved_areas(void)
{
int i;
for (i = 0; i < cma_area_count; i++) {
int ret = cma_activate_area(&cma_areas[i]);
if (ret)
return ret;
}
return 0;
}
core_initcall(cma_init_reserved_areas);Allocation uses cma_alloc, which ultimately calls alloc_contig_range(..., MIGRATE_CMA, ...). Because CMA may involve page migration and reclamation, it should not be used in atomic contexts.
Conclusion
By following the CPU’s memory‑access path, understanding virtual‑to‑physical translation, mastering the zone and buddy allocators, learning how slab and vmalloc build on top of them, and finally seeing how CMA provides contiguous memory, readers obtain a complete, closed‑loop view of Linux memory management.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
