Unveiling Linux mmap: From Virtual Memory to Page Tables and Huge Pages
This article provides an in‑depth exploration of the Linux mmap system call, covering its role in virtual memory management, page table structures, various mapping types (anonymous, file‑backed, shared, private), flag options, and advanced concepts such as huge pages and transparent huge pages, with kernel‑level diagrams and code examples.
Based on Linux kernel 5.4 source, this article explains the complex mmap system call and its impact on the entire memory management subsystem, including page tables, file systems, and page‑fault handling.
本文基于内核 5.4 版本源码讨论
Readers previously asked for a systematic introduction to memory‑mapping (mmap). Although the syscall looks simple, it actually triggers a cascade of operations across virtual memory, file systems, and page tables.
Before diving in, the author poses three questions to guide the discussion:
How are virtual and physical memory created, and how does the kernel allocate virtual memory?
What exactly is being mapped when we map anonymous pages versus file pages?
When does the kernel build the full four‑level page‑table hierarchy for a process?
1. Detailed mmap System Call
#include <sys/mman.h>
void* mmap(void* addr, size_t length, int prot, int flags, int fd, off_t offset);
// Kernel implementation: SYSCALL_DEFINE6(mmap, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long fd, unsigned long off)mmap maps a region of the process’s virtual address space to either an anonymous physical page or a file region on disk. The virtual region resides in the file‑mapping and anonymous‑mapping area of the process’s address space.
The two primary parameters that determine the size of the mapping are:
addr : a hint for the starting virtual address; if NULL the kernel chooses.
length : the size of the region, which must be aligned to PAGE_SIZE (4 KB).
addr, length 必须要按照 PAGE_SIZE(4K) 对齐。
For file mappings, fd specifies the file descriptor and offset specifies the offset within the file (also page‑aligned).
Virtual Memory Areas (VMA)
All mappings are represented by struct vm_area_struct (VMA) objects linked in two ways:
A doubly‑linked list ordered by increasing virtual address.
A red‑black tree for fast lookup.
struct mm_struct {
struct vm_area_struct *mmap; /* list of VMAs */
struct rb_root mm_rb; /* red‑black tree of VMAs */
};
struct vm_area_struct {
struct vm_area_struct *vm_next, *vm_prev; /* list links */
struct rb_node vm_rb; /* tree node */
unsigned long vm_start; /* start address */
unsigned long vm_end; /* end address (first byte after) */
struct file *vm_file; /* associated file, NULL for anonymous */
unsigned long vm_pgoff; /* offset within file (in pages) */
pgprot_t vm_page_prot; /* protection bits */
unsigned long vm_flags; /* mapping flags */
};The kernel creates a VMA when mmap is called, fills the fields based on the arguments, and returns the chosen virtual address.
Protection Flags (prot)
#define PROT_READ 0x1 /* page can be read */
#define PROT_WRITE 0x2 /* page can be written */
#define PROT_EXEC 0x4 /* page can be executed */
#define PROT_NONE 0x0 /* page cannot be accessed */ PROT_READ: underlying physical memory is readable. PROT_WRITE: underlying physical memory is writable. PROT_EXEC: contains executable code (e.g., .text segment). PROT_NONE: used for guard pages or to reserve address space.
mprotect 系统调用可以动态修改进程虚拟内存空间中任意一段虚拟内存区域的权限。
Mapping Flags (flags)
#define MAP_FIXED 0x10 /* interpret addr exactly */
#define MAP_ANONYMOUS 0x20 /* don't use a file */
#define MAP_SHARED 0x01 /* share changes */
#define MAP_PRIVATE 0x02 /* changes are private */
#define MAP_LOCKED 0x2000 /* pages are locked */
#define MAP_POPULATE 0x008000 /* pre‑fault page tables */
#define MAP_HUGETLB 0x040000 /* use huge pages */Key flag combinations produce four fundamental mapping types:
Private anonymous mapping ( MAP_PRIVATE | MAP_ANONYMOUS) – used by malloc for large allocations; creates a VMA without any physical memory until a page fault occurs.
Private file mapping ( MAP_PRIVATE with a valid fd) – reads are shared, writes trigger copy‑on‑write, and modifications never reach the underlying file.
Shared file mapping ( MAP_SHARED) – multiple processes map the same file pages; writes are visible to all and are eventually flushed to disk.
Shared anonymous mapping ( MAP_SHARED | MAP_ANONYMOUS) – used between parent and child after fork(); implemented via a temporary file in tmpfs so that all participants share the same physical page.
2. Private Anonymous Mapping
When a process accesses an unmapped VMA, the MMU raises a page‑fault. The kernel allocates a zero‑filled physical page, updates the page table, and resumes execution. The page is not backed by any file.
#include <unistd.h>
int execve(const char* filename, const char* const argv[], const char* const envp[]);During execve, the kernel discards the old address space, creates new VMAs for the ELF .text and .data sections using private file mappings, and allocates fresh anonymous VMAs for BSS, heap, and stack.
3. Private File Mapping
Mapping a file with MAP_PRIVATE creates a VMA linked to struct file and struct inode. The kernel first looks for the file’s page in the page cache; if absent, it allocates a page, reads the file block, and fills the page. Subsequent reads hit the cache.
struct ext4_inode {
__le16 i_mode; /* file mode */
__le32 i_blocks_lo; /* block count */
__le32 i_block[EXT4_N_BLOCKS]; /* block pointers */
};On a write, the kernel performs copy‑on‑write: it allocates a new private page, copies the cached data, updates the PTE to point to the new page, and marks it writable. The modification stays private and is never written back to the file.
4. Shared File Mapping
With MAP_SHARED, the VMA’s vm_flags include the shared flag. All processes map the same page cache entry, so reads and writes are visible to every participant. Dirty pages are eventually flushed to disk by the kernel’s write‑back daemon. #define MAP_SHARED 0x01 /* Share changes */ Kernel parameters under /proc/sys/vm (e.g., dirty_writeback_centisecs, dirty_ratio) control when dirty pages are written back.
5. Shared Anonymous Mapping
Implemented via a temporary file in tmpfs (mounted at /dev/zero). The VMA’s vm_file points to this anonymous file, allowing multiple processes (typically a parent and its children after fork()) to share the same physical page. The mapping is created with MAP_SHARED | MAP_ANONYMOUS and fd = -1.
6. Additional Flag Values
Beyond the core flags, useful options include: MAP_LOCKED: locks the pages in RAM, preventing swap. MAP_POPULATE: pre‑faults page tables so the mapping is usable immediately. MAP_HUGETLB: requests a huge page (standard hugetlb page). It must be combined with MAP_ANONYMOUS.
7. Huge Page Mappings
Two kinds of huge pages exist:
Standard hugetlb pages – allocated from a pre‑reserved pool defined at boot via kernel command‑line parameters ( hugepagesz, hugepages, default_hugepagesz) or adjusted at runtime via /proc/sys/vm/nr_hugepages and nr_overcommit_hugepages. They are never swapped.
Transparent Huge Pages (THP) – allocated dynamically by the kernel thread khugepaged. THP can be enabled globally ( always), disabled ( never), or enabled per‑process via madvise(..., MADV_HUGEPAGE). THP pages may be swapped and are limited to 2 MiB.
To use hugetlb pages directly with mmap:
void* addr = mmap(NULL, length, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB,
-1, 0);For file‑backed huge pages, mount a hugetlbfs filesystem (e.g., mount -t hugetlbfs none /mnt/huge) and mmap files under that mount point; the kernel automatically uses huge pages.
8. Summary
The article covered five major aspects of mmap:
Private anonymous mapping – for heap, BSS, and stack allocation.
Private file mapping – read‑shared, copy‑on‑write, no disk write‑back.
Shared file mapping – true sharing, dirty pages flushed to disk.
Shared anonymous mapping – parent‑child communication via tmpfs anonymous files.
Huge‑page mappings – both standard hugetlb pages and transparent huge pages, their configuration, and usage patterns.
Future articles will dive into the actual kernel source implementing mmap.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Bin's Tech Cabin
Original articles dissecting source code and sharing personal tech insights. A modest space for serious discussion, free from noise and bureaucracy.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
