Fundamentals 20 min read

Why Is the First memset So Slow? Exploring Page Faults, TLB, and Huge Pages

The article explains why the initial memset on a newly‑allocated 1 GB buffer is much slower than subsequent calls, detailing how page‑fault handling, TLB misses, and the MMU’s multi‑level page tables cause overhead, and demonstrates optimizations such as using huge pages, MAP_POPULATE, and pre‑mapping to eliminate the slowdown.

Liangxu Linux

Feb 5, 2024

Why Is the First memset So Slow? Exploring Page Faults, TLB, and Huge Pages

Benchmark and Observation

A program allocates a 1 GB buffer with malloc and calls memset three times, measuring each call. The first call takes ~0.66 s, while the second and third take ~0.16 s.

Why the First memset Is Slow

When malloc reserves virtual memory, the kernel does not allocate physical pages immediately. The first write to each 4 KB page triggers a page‑fault; the kernel allocates a physical page, updates the page tables, and resumes the process. For a 1 GB region this means roughly 262 144 page‑faults on the first pass. Subsequent memset calls touch pages that are already backed, so no faults occur and the operation is much faster.

Measuring Page Faults

Using perf a single memset generates about 262 199 page‑faults (the extra ~55 come from code, data and stack pages). Running all three memset calls yields the same total, confirming that only the first pass incurs the fault overhead.

Virtual‑Address Translation Basics

On x86‑64 a 48‑bit virtual address is split into five fields (PML4, PDPT, PDT, PT indices and page offset). The Memory Management Unit (MMU) walks a four‑level page‑table hierarchy to translate the virtual address to a physical page. Because each walk may require several memory accesses, the CPU caches recent translations in the Translation Lookaside Buffer (TLB). A TLB hit avoids the full walk; a miss forces the MMU to traverse the tables, adding latency. The TLB holds only a few dozen entries, so it can be evicted when many distinct pages are accessed.

Huge Pages as an Optimization

Using larger pages (2 MB or 1 GB) reduces the number of pages, page‑table entries, and TLB pressure. For a 4 GB region:

4 KB pages → 1 048 576 entries

2 MB pages → 2 048 entries

1 GB pages → 4 entries

Linux supports two kinds of huge pages:

Standard huge pages – must be pre‑allocated via /proc/sys/vm/nr_hugepages.

Transparent Huge Pages (THP) – allocated on demand for 2 MB pages but can be swapped out.

Experiment with 2 MB Huge Pages

# Disable THP
 echo never > /sys/kernel/mm/transparent_hugepage/enabled
# Verify huge‑page size
 cat /proc/meminfo | grep Hugepagesize
# Reserve 1024 huge pages (2 GB total)
 echo 1024 > /proc/sys/vm/nr_hugepages

Recompile the benchmark to use mmap with MAP_HUGETLB (or ensure the allocation lands on a pre‑reserved huge page). Results:

First memset time drops from ~0.66 s to ~0.24 s.

Page‑fault count falls from ~262 199 to ~568.

dTLB‑load‑misses drop from ~2.57 M to ~27 k.

Pre‑populating Pages with MAP_POPULATE

Even with huge pages a few faults remain. Adding MAP_POPULATE forces the kernel to fault‑in all pages during mmap, eliminating runtime faults.

void *addr = mmap(NULL, size, PROT_READ|PROT_WRITE,
                  MAP_PRIVATE|MAP_ANONYMOUS|MAP_POPULATE|MAP_HUGETLB,
                  -1, 0);

With MAP_POPULATE the three memset calls run at nearly identical speed and page‑faults drop to the baseline ~55 (only code/data/stack pages).

Additional Practical Optimizations

Improve locality – sequential accesses are faster than random accesses.

Prefetch or warm‑up memory before entering a critical path.

Disable swap for latency‑sensitive workloads.

Consider NUMA effects and bind threads to the memory node they use most.

Conclusion

The large performance gap of the first memset is caused by lazy page allocation and TLB misses when the kernel maps a huge number of 4 KB pages. Switching to 2 MB huge pages dramatically reduces both page‑fault count and TLB pressure. Adding MAP_POPULATE eliminates the remaining overhead, making all three memset calls equally fast.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Page Fault TLB memory-management Huge Pages

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Benchmark and Observation

Why the First memset Is Slow

Measuring Page Faults

Virtual‑Address Translation Basics

Huge Pages as an Optimization

Experiment with 2 MB Huge Pages

Pre‑populating Pages with MAP_POPULATE

Additional Practical Optimizations

Conclusion

Liangxu Linux

How this landed with the community

Was this worth your time?

0 Comments

Experiment with 2 MB Huge Pages