Fundamentals 85 min read

Deep Dive into Linux mmap: How the Kernel Allocates Virtual Memory

This article explains the Linux kernel's mmap implementation, covering how virtual memory is allocated, the role of get_unmapped_area, mmap_region, overcommit policies, VMA merging, and the underlying data structures such as mm_struct and vm_area_struct.

Bin's Tech Cabin

Oct 10, 2023

Deep Dive into Linux mmap: How the Kernel Allocates Virtual Memory

1. Preprocess Large Page Mapping

The SYSCALL_DEFINE6(mmap, ...) entry point forwards to ksys_mmap_pgoff, which handles large‑page preprocessing. It checks whether the mapping is anonymous or file‑backed, validates MAP_HUGETLB usage, aligns the length to the huge‑page size, and prepares the file structure for hugetlbfs files. The function returns error codes like EBADF or EINVAL for invalid arguments.

SYSCALL_DEFINE6(mmap, unsigned long addr, unsigned long len,
                unsigned long prot, unsigned long flags,
                unsigned long fd, unsigned long off)
{
    error = ksys_mmap_pgoff(addr, len, prot, flags, fd, off >> PAGE_SHIFT);
}

The key points extracted from ksys_mmap_pgoff are:

Anonymous mappings must set MAP_ANONYMOUS, otherwise EBADF is returned.

File mappings to huge pages require the file to reside on a hugetlbfs filesystem; MAP_HUGETLB cannot be used with regular files. MAP_HUGETLB can only be combined with MAP_ANONYMOUS. The kernel reserves the required number of huge pages in the appropriate hstate pool before creating the mapping.

The reserved huge pages are visible in /proc/meminfo under the HugePages_Rsvd field.

2. Whether to Allocate Physical Memory Immediately

Normally mmap only reserves virtual address space; physical pages are allocated on first access via a page‑fault. If the flags MAP_POPULATE or MAP_LOCKED are set, the kernel pre‑populates the pages by walking the VMA range and triggering page faults for each page.

static int __mm_populate(unsigned long start, unsigned long len, int ignore_errors)
{
    // Walk the VMA list and allocate pages for each page in the range.
}

The function populate_vma_page_range calls __get_user_pages to allocate physical pages for each virtual page.

3. Overall Virtual Memory Mapping Process

The kernel first finds an unmapped area using get_unmapped_area, which may involve arch_get_unmapped_area (classic layout) or arch_get_unmapped_area_topdown (new layout). The layout depends on the /proc/sys/vm/legacy_va_layout setting.

The search ensures the candidate region lies between low_limit (usually mm->mmap_base) and high_limit ( TASK_SIZE), and that the gap is large enough for the requested length.

4. Finding an Unmapped Area

The kernel stores all VMAs in a red‑black tree ( mm->mm_rb) for fast lookup. find_vma searches this tree for the first VMA whose vm_end exceeds the requested address. If a suitable gap exists between a VMA and its predecessor, that gap becomes the unmapped area.

struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
{
    // Search the red‑black tree for the first VMA with vm_end > addr.
}

Each VMA node stores rb_subtree_gap, the maximum gap size within its subtree, allowing the kernel to skip subtrees that cannot satisfy the request.

5. The Essence of Memory Mapping

After an address range [addr, addr+len] is selected, mmap_region creates a vm_area_struct (VMA) for it. The kernel first checks limits via may_expand_vm (total virtual memory and data segment limits). If the mapping is MAP_FIXED, overlapping existing VMAs are unmapped with do_munmap before proceeding.

bool may_expand_vm(struct mm_struct *mm, vm_flags_t flags, unsigned long npages)
{
    if (mm->total_vm + npages > rlimit(RLIMIT_AS) >> PAGE_SHIFT)
        return false;
    if (is_data_mapping(flags) &&
        mm->data_vm + npages > rlimit(RLIMIT_DATA) >> PAGE_SHIFT)
        return false;
    return true;
}

Mappings that are private and writable (accountable mappings) increase the vm_committed_as counter, which is reflected in /proc/meminfo as Committed_AS. The kernel enforces the overcommit policy ( /proc/sys/vm/overcommit_memory) when accounting for such mappings.

The overcommit policies are:

OVERCOMMIT_GUESS (0) : total requested pages must not exceed physical RAM + swap.

OVERCOMMIT_ALWAYS (1) : the kernel always allows the allocation.

OVERCOMMIT_NEVER (2) : the kernel limits allocations to vm_commit_limit(), which is roughly 50% of RAM (excluding huge pages) plus swap.

5.1 Overcommit Limit Calculation

unsigned long vm_commit_limit(void)
{
    unsigned long allowed;
    if (sysctl_overcommit_kbytes)
        allowed = sysctl_overcommit_kbytes >> (PAGE_SHIFT - 10);
    else
        allowed = ((totalram_pages() - hugetlb_total_pages()) *
                   sysctl_overcommit_ratio / 100);
    allowed += total_swap_pages();
    return allowed;
}

Additional reserves ( admin_reserve_kbytes and user_reserve_kbytes) are subtracted from the limit to guarantee root operations and emergency recovery.

5.2 VMA Merging

Before allocating a new VMA, the kernel attempts to merge the new region with adjacent VMAs to avoid extra allocations. Merging is possible only when:

The VMAs share the same flags (ignoring VM_SOFTDIRTY).

File‑backed VMAs map the same file and have contiguous vm_pgoff values.

Anonymous VMAs share the same anon_vma.

NUMA policies match.

Neither VMA defines a close operation.

The kernel distinguishes two basic layouts:

Classic layout (low‑to‑high address growth).

Top‑down layout (high‑to‑low address growth).

Depending on the layout, eight merge cases are handled. For example, in the classic layout, if the new area starts exactly where prev->vm_end ends and ends exactly where next->vm_start begins, the kernel merges all three VMAs into prev and removes next:

if (prev && prev->vm_end == addr &&
    next && end == next->vm_start &&
    can_vma_merge_after(prev, vm_flags, anon_vma, file, pgoff, ctx) &&
    can_vma_merge_before(next, vm_flags, anon_vma, file, pgoff+pglen, ctx)) {
    __vma_adjust(prev, prev->vm_start, next->vm_end,
                 prev->vm_pgoff, NULL, prev);
    // prev now represents the merged region.
    return prev;
}

Other cases handle partial overlaps, extending only prev or next, or creating a new VMA when no merge is possible.

6. Summary

This article, together with the previous "From Kernel World to mmap Essence (Principles)" piece, provides a comprehensive view of how Linux implements memory mapping. It covers five usage scenarios (private anonymous, private file‑backed, shared file‑backed, shared anonymous, and huge‑page mappings), the core kernel functions get_unmapped_area and mmap_region, the overcommit policy, and the intricate VMA merging logic.

Understanding these mechanisms equips developers and system engineers with the knowledge to reason about virtual memory behavior, performance implications, and kernel‑level debugging of memory‑related issues.

Kernel Linux mmap Overcommit VMA

Written by

Bin's Tech Cabin

Original articles dissecting source code and sharing personal tech insights. A modest space for serious discussion, free from noise and bureaucracy.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.