Cloud Computing 18 min read

Optimizing Service Performance with Linux Transparent Huge Pages in Large‑Scale Cloud Clusters

By configuring Linux Transparent Huge Pages and tuning kernel parameters, large‑scale cloud services can cut TLB misses and page‑table overhead, achieving over ten percent CPU and latency reductions while limiting memory growth to about ten percent, as demonstrated in Baidu’s production recommendation system.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Optimizing Service Performance with Linux Transparent Huge Pages in Large‑Scale Cloud Clusters

This article explores the design, implementation, and practical results of using Linux Transparent Huge Pages (THP) to improve memory allocation efficiency, service performance, and cost efficiency in large‑scale cloud machine clusters.

Background

Modern services run on machines with very large memory (≥700 GB). Traditional 4 KB page management suffers from high TLB miss rates and page‑table cache pressure, especially for workloads such as recommendation systems that allocate large intermediate data and use extensive local dictionaries and caches.

Transparent Huge Pages (THP) allow the kernel to allocate larger pages (2 MB or 1 GB) automatically, reducing page‑table entries and TLB pressure. However, THP introduces challenges such as increased memory consumption, fragmentation, and occasional latency spikes.

Huge Page Types

Two major huge‑page mechanisms are discussed:

Standard HugePages (hugetlb): static allocation of 2 MB or 1 GB pages; low runtime overhead but requires pre‑allocation.

Transparent HugePages (THP): dynamic allocation managed by the kernel; can fall back to normal pages when allocation fails.

Advantages of Standard HugePages

No swapping of huge pages.

Reduces TLB cache pressure.

Decreases page‑table load.

Eliminates page‑table lookups for huge pages.

Improves overall memory performance.

Disadvantages of Standard HugePages

Requires careful sizing to avoid memory waste.

Static configuration does not adapt to hardware changes.

THP Advantages

Kernel dynamically manages allocation and can split/merge pages as needed.

Transparent to applications; widely tested in the open‑source community.

THP Disadvantages

Runtime allocation can add CPU overhead.

Synchronous allocation may cause occasional latency spikes.

Configuration Commands

Standard HugePages status:

grep Huge /proc/meminfo
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB

Set the number of static huge pages: sysctl -w vm.nr_hugepages=20 Enable THP globally or per‑process:

echo always > /sys/kernel/mm/transparent_hugepage/enabled
echo madvise > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/enabled

Control THP defragmentation mode:

echo always > /sys/kernel/mm/transparent_hugepage/defrag
echo defer > /sys/kernel/mm/transparent_hugepage/defrag
echo defer+madvise > /sys/kernel/mm/transparent_hugepage/defrag
echo madvise > /sys/kernel/mm/transparent_hugepage/defrag
echo never > /sys/kernel/mm/transparent_hugepage/defrag

Kernel‑level THP parameters for better trade‑off:

echo '100' > /sys/kernel/mm/transparent_hugepage/khugepaged/alloc_sleep_millisecs
echo '511' > /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none
echo '2048' > /sys/kernel/mm/transparent_hugepage/khugepaged/pages_to_scan
echo '100' > /sys/kernel/mm/transparent_hugepage/khugepaged/scan_sleep_millisecs

Application‑Level Usage

For custom huge‑page allocation (e.g., C++ services), the following pattern is used:

size_t mem_size = num * (2UL << 20); // 2 MB aligned
void* ptr = (char*)mmap(nullptr, mem_size, PROT_READ | PROT_WRITE, MAP_PRIVATE|MAP_ANON, -1, 0);
int ret = madvise(ptr, mem_size, MADV_HUGEPAGE);
if (ret != 0) { std::cout << "fail ret:" << ret; return -1; }

Memory pools are introduced to keep large allocations alive for the lifetime of the service, reducing fragmentation and improving cache locality.

Performance Evaluation

In a recommendation‑system service, THP reduced dtlb_load_misses.walk_active and dtlb_store_misses.walk_active by more than 66 % and achieved:

CPU usage improvement: ~11.3 %

Average latency reduction: ~11.2 %

These gains were observed across a large‑scale production cluster, confirming that THP can deliver double‑digit performance improvements while keeping memory growth within ~10 %.

Conclusion

The study demonstrates that transparent huge pages, combined with kernel tuning and memory‑pool optimizations, provide a practical and effective way to boost service performance in cloud environments. The solution has been fully rolled out in Baidu’s core recommendation system, delivering >10 % CPU and latency improvements. Ongoing work continues to refine C++ service optimization practices and invites further community collaboration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud computingLinuxTransparent Huge Pages
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.