Understanding x86-64 Memory Paging Mechanism and Its Implementation
This article explains the role, operation, and different modes of x86‑64 memory paging, describes how page directories and tables are set up in assembly, shows the relevant control‑register bits, and provides practical code examples for enabling and managing paging in a protected‑mode kernel.
In the x86‑64 architecture, the memory paging mechanism acts as a bridge between virtual and physical addresses by dividing linear (virtual) addresses into fixed‑size pages and mapping each page to a physical address.
Paging is only available in protected mode (CR0.PE = 1). When paging is enabled, programs see a linear address space rather than actual physical addresses, similar to a virtual‑reality game where the visible world is virtual.
Paging is controlled by the PG bit (bit 31) of the CR0 register. When CR0.PG = 0, paging is disabled and linear addresses equal physical addresses; when CR0.PG = 1, paging is enabled and address translation occurs.
Benefits of memory paging include:
Virtualization – each process gets an independent address space, improving security and isolation.
Memory sharing – multiple processes can share the same physical page, saving RAM.
Lazy loading – pages are loaded only when accessed, reducing initialization time and memory usage.
Memory protection – pages can be marked read‑only or non‑executable to protect code and data.
1. Role of Memory Paging
Memory paging splits physical memory into fixed‑size pages (usually 4 KB) and maps virtual memory to these pages. Its main functions are virtual memory management, memory protection, memory sharing, flexible memory management (allocation, reclamation, page replacement, compression), and reduction of external fragmentation.
1.1 First‑Level Page Table
Before paging, a linear address is obtained by adding a segment base (from a selector) to an offset inside the segment. After paging is enabled, the linear address no longer equals the physical address; the CPU must look up the physical address in the page tables.
Example: assume the segment base is 0 and the offset is 0x1234.
1.2 Second‑Level Page Table
Although the first‑level table is shown for illustration, real systems use a second‑level table because the first‑level table has several drawbacks:
It can hold up to 1 Mi entries (4 MiB total size).
All entries must be pre‑created, consuming a large amount of memory.
Each process needs its own page tables, which quickly becomes space‑intensive.
Therefore, a page‑directory entry is added to point to a second‑level table.
Each process has its own page tables, allowing the same virtual address in different processes to map to different physical addresses, achieving isolation and reducing fragmentation.
1.3 Page‑Table and Directory Entry Flags
P (Present) – 1 if the page is in physical memory.
RW (Read/Write) – 1 if writable.
US (User/Supervisor) – 1 if accessible from all privilege levels.
PWT (Page‑Level Write‑Through) – usually 0.
PCD (Page‑Level Cache Disable) – usually 0.
A (Accessed) – set by the CPU when the page is accessed.
D (Dirty) – set when the page is written.
PAT (Page Attribute Table) – usually 0.
G (Global) – 1 if the entry is global and stays in the TLB.
AVL (Available) – reserved for OS use.
The page‑directory base address is stored in CR3 (also called the Page Directory Base Register).
2. Enabling and Disabling Paging
Enabling paging involves three steps:
Prepare the page directory and page tables.
Write the high 20 bits of the page‑directory physical address into CR3.
Set the PG bit (bit 31) of CR0.
Paging works only in protected mode (CR0.PE = 1). The PG bit of CR0 determines whether paging is active.
; os/src/boot/loader.s
; 32‑bit protected‑mode entry point
[bits 32]
p_mode_start:
mov ax, SELECTOR_DATA
mov ds, ax
mov es, ax
mov ss, ax
mov esp, LOADER_STACK_TOP
mov ax, SELECTOR_VIDEO
mov gs, ax
; write a test string to video memory
mov byte [gs:320], 'M'
mov byte [gs:322], 'A'
mov byte [gs:324], 'I'
mov byte [gs:326], 'N'
call setup_page ; create page directory & tables
sgdt [gdt_ptr] ; save original GDT
mov ebx, [gdt_ptr + 2]
or dword [ebx + 0x18 + 4], 0xc0000000 ; adjust video segment base
add dword [gdt_ptr + 2], 0xc0000000 ; map GDT to high kernel address
add esp, 0xc0000000 ; map stack pointer
mov eax, PAGE_DIR_TABLE_POS
mov cr3, eax ; load page directory address
mov eax, cr0
or eax, 0x80000000 ; set CR0.PG (enable paging)
mov cr0, eax
lgdt [gdt_ptr] ; reload GDT with new base
; write "Virtual" to video memory to show mapping succeeded
mov byte [gs:320], 'V'
mov byte [gs:322], 'i'
mov byte [gs:324], 'r'
mov byte [gs:326], 't'
mov byte [gs:328], 'u'
mov byte [gs:330], 'a'
mov byte [gs:332], 'l'
jmp $ ; halt
setup_page:
mov ecx, 4096
mov esi, 0
.clear_page_dir:
mov byte [PAGE_DIR_TABLE_POS + esi], 0
inc esi
loop .clear_page_dir
; create first page directory entry (maps first 4 MiB)
mov eax, PAGE_DIR_TABLE_POS
add eax, 0x1000
mov ebx, eax
or eax, PG_US_U | PG_RW_W | PG_P
mov [PAGE_DIR_TABLE_POS + 0x0], eax
mov [PAGE_DIR_TABLE_POS + 0xc00], eax
sub eax, 0x1000
mov [PAGE_DIR_TABLE_POS + 4092], eax
; create first page table (maps first 1 MiB)
mov ecx, 256
mov esi, 0
mov edx, PG_US_U | PG_RW_W | PG_P
.create_pte:
mov [ebx+esi*4], edx
add edx, 4096
inc esi
loop .create_pte
; create remaining kernel page‑directory entries
mov eax, PAGE_DIR_TABLE_POS
add eax, 0x2000
or eax, PG_US_U | PG_RW_W | PG_P
mov ebx, PAGE_DIR_TABLE_POS
mov ecx, 254
mov esi, 769
.create_kernel_pde:
mov [ebx+esi*4], eax
inc esi
add eax, 0x1000
loop .create_kernel_pde
retMacro definitions used in boot.inc :
PAGE_DIR_TABLE_POS equ 0x100000
PG_P equ 1b
PG_RW_R equ 00b
PG_RW_W equ 10b
PG_US_S equ 000b
PG_US_U equ 100bThe resulting page‑directory and page‑table layout can be visualized as follows:
Two directory entries point to the first page table: entry 0 (low addresses) and entry 768 (addresses starting with 0xC0000000). This mapping lets the kernel reside below the 1 MiB physical limit while being accessed via the high virtual address range.
3. Four Paging Modes Supported by Intel‑64
32‑bit paging: CR4.PAE = 0
PAE paging: CR4.PAE = 1, IA32_EFER.LME = 0
4‑level paging: CR4.PAE = 1, IA32_EFER.LME = 1, CR4.LA57 = 0
5‑level paging: CR4.PAE = 1, IA32_EFER.LME = 1, CR4.LA57 = 1
The current mode is determined by the combination of CR4.PAE, CR4.LA57, and IA32_EFER.LME.
3.1 32‑Bit Paging
In 32‑bit mode, CR3 holds the physical address of the page directory. Linear address translation follows the chain CR3 → PDE → PTE → physical page. Each page is 4 KB, and the linear address is split into a 10‑bit page‑directory index, a 10‑bit page‑table index, and a 12‑bit offset.
3.2 PAE Paging
Physical Address Extension (PAE) expands the physical address width to up to 52 bits, allowing more than 4 GiB of RAM. Page‑table entries become 64 bits, and an extra level (PDPTE) is introduced. The translation chain is CR3 → PDPTE → PDE → PTE → physical page.
3.3 4‑Level Paging
4‑level paging adds a top‑level PML4 table. The translation chain becomes CR3 → PML4E → PDPTE → PDE → PTE → physical page. It supports 48‑bit virtual addresses and 52‑bit physical addresses, and is used on modern 64‑bit operating systems.
3.4 5‑Level Paging
5‑level paging (enabled when CR4.PAE = 1, IA32_EFER.LME = 1, CR4.LA57 = 1) extends the virtual address space to 57 bits, adding a fifth level (PML5). It is intended for workloads requiring extremely large address spaces, such as large‑scale cloud or big‑data platforms.
4. Hierarchical Paging Structure
All paging modes use hierarchical tables where each table occupies one 4 KB page. In 32‑bit modes each entry is 4 bytes (1024 entries per table). In 4‑ and 5‑level modes each entry is 8 bytes (512 entries per table). The linear address is divided into an index part (used to walk the tables) and an offset part (within the final physical page).
4.1 Translation Process
The CPU starts with the address stored in CR3, extracts the highest index bits to locate an entry in the top‑level table, then proceeds down the hierarchy until a page‑frame address is found. If an entry’s Present bit is 0 or a reserved bit is set, a page‑fault exception is raised.
4.2 Reserved Bits and Exceptions
Bits 51:MAXPHYADDR are reserved.
PS bits in PML5E/PML4E are reserved.
If the processor does not support 1‑GiB pages, the PS bit in PDPTE is reserved.
If PDPTE.PS = 1, bits 29:13 are reserved.
If PDE.PS = 1, bits 20:13 are reserved.
If IA32_EFER.NXE = 0, the XD (bit 63) flag is reserved.
5. Practical Use and Case Study
On a Linux system the cpuid utility can be used to query the CPU’s supported address widths. Example output shows a maximum of 39 physical address bits and 48 virtual address bits, indicating support for 4‑level paging.
physical address extensions = true
maximum physical address bits = 0x27 (39)
maximum linear (virtual) address bits = 0x30 (48)Linux divides the virtual address space as follows: user space occupies 0x0000000000000000‑0x00007FFFFFFFFFFF (47 bits), while kernel space occupies 0xFFFF800000000000‑0xFFFFFFFFFFFFFFFF (canonical 47‑bit addresses). The unused “hole” between these ranges satisfies the canonical‑address requirement.
By printing the page‑table hierarchy from the kernel (e.g., in arch/x86/mm/init.c ) one can observe the actual physical addresses used for kernel mappings, verify that the page‑directory entry for the kernel points to the correct page tables, and confirm that the NX bit (bit 63) is set for non‑executable pages.
#include
#define __AC(X,Y) (X##Y)
#define _AC(X,Y) __AC(X,Y)
#define __PAGE_OFFSET_BASE_L4 _AC(0xffff888000000000, UL)
#define __PAGE_OFFSET __PAGE_OFFSET_BASE_L4
#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET)
#define __START_KERNEL_map _AC(0xffffffff80000000, UL)
#define __va(x) ((void *)((unsigned long)(x)+PAGE_OFFSET))
static inline unsigned long __phys_addr_nodebug(unsigned long x)
{
unsigned long y = x - __START_KERNEL_map;
x = y + ((x > y) ? 0 : (__START_KERNEL_map - PAGE_OFFSET));
return x;
}
#define __phys_addr(x) __phys_addr_nodebug(x)
#define __pa(x) __phys_addr((unsigned long)(x))
int main() {
printf("address: 0x%lx\n", __pa(0xffff88800220a000UL));
printf("address: 0x%lx\n", __pa(0xffffffff8220a000UL));
printf("address: 0x%lx\n", __pa_symbol(0xffffffff8220a000UL));
}Running the program prints the same physical address (0x220a000) for both virtual addresses, demonstrating the kernel’s linear‑to‑physical mapping.
address: 0x220a000
address: 0x220a000
address: 0x220a000Additional kernel code can walk the page‑tables and print each level’s entries, confirming that the final physical pages are correctly mapped and that the NX flag (bit 63) is set for non‑executable pages.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.