Why Is the First memset So Slow? Exploring Page Faults, TLB, and Huge Pages
The article explains why the initial memset on a newly‑allocated 1 GB buffer is much slower than subsequent calls, detailing how page‑fault handling, TLB misses, and the MMU’s multi‑level page tables cause overhead, and demonstrates optimizations such as using huge pages, MAP_POPULATE, and pre‑mapping to eliminate the slowdown.
Benchmark and Observation
A program allocates a 1 GB buffer with malloc and calls memset three times, measuring each call. The first call takes ~0.66 s, while the second and third take ~0.16 s.
Why the First memset Is Slow
When malloc reserves virtual memory, the kernel does not allocate physical pages immediately. The first write to each 4 KB page triggers a page‑fault; the kernel allocates a physical page, updates the page tables, and resumes the process. For a 1 GB region this means roughly 262 144 page‑faults on the first pass. Subsequent memset calls touch pages that are already backed, so no faults occur and the operation is much faster.
Measuring Page Faults
Using perf a single memset generates about 262 199 page‑faults (the extra ~55 come from code, data and stack pages). Running all three memset calls yields the same total, confirming that only the first pass incurs the fault overhead.
Virtual‑Address Translation Basics
On x86‑64 a 48‑bit virtual address is split into five fields (PML4, PDPT, PDT, PT indices and page offset). The Memory Management Unit (MMU) walks a four‑level page‑table hierarchy to translate the virtual address to a physical page. Because each walk may require several memory accesses, the CPU caches recent translations in the Translation Lookaside Buffer (TLB). A TLB hit avoids the full walk; a miss forces the MMU to traverse the tables, adding latency. The TLB holds only a few dozen entries, so it can be evicted when many distinct pages are accessed.
Huge Pages as an Optimization
Using larger pages (2 MB or 1 GB) reduces the number of pages, page‑table entries, and TLB pressure. For a 4 GB region:
4 KB pages → 1 048 576 entries
2 MB pages → 2 048 entries
1 GB pages → 4 entries
Linux supports two kinds of huge pages:
Standard huge pages – must be pre‑allocated via /proc/sys/vm/nr_hugepages.
Transparent Huge Pages (THP) – allocated on demand for 2 MB pages but can be swapped out.
Experiment with 2 MB Huge Pages
# Disable THP
echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Verify huge‑page size
cat /proc/meminfo | grep Hugepagesize
# Reserve 1024 huge pages (2 GB total)
echo 1024 > /proc/sys/vm/nr_hugepagesRecompile the benchmark to use mmap with MAP_HUGETLB (or ensure the allocation lands on a pre‑reserved huge page). Results:
First memset time drops from ~0.66 s to ~0.24 s.
Page‑fault count falls from ~262 199 to ~568.
dTLB‑load‑misses drop from ~2.57 M to ~27 k.
Pre‑populating Pages with MAP_POPULATE
Even with huge pages a few faults remain. Adding MAP_POPULATE forces the kernel to fault‑in all pages during mmap, eliminating runtime faults.
void *addr = mmap(NULL, size, PROT_READ|PROT_WRITE,
MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE|MAP_HUGETLB,
-1, 0);With MAP_POPULATE the three memset calls run at nearly identical speed and page‑faults drop to the baseline ~55 (only code/data/stack pages).
Additional Practical Optimizations
Improve locality – sequential accesses are faster than random accesses.
Prefetch or warm‑up memory before entering a critical path.
Disable swap for latency‑sensitive workloads.
Consider NUMA effects and bind threads to the memory node they use most.
Conclusion
The large performance gap of the first memset is caused by lazy page allocation and TLB misses when the kernel maps a huge number of 4 KB pages. Switching to 2 MB huge pages dramatically reduces both page‑fault count and TLB pressure. Adding MAP_POPULATE eliminates the remaining overhead, making all three memset calls equally fast.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
