Unlock Linux Performance: Master Memory Watermarks and OOM Killer
This article explains how Linux memory watermarks, kswapd, direct reclaim, and the OOM Killer interact, provides detailed code examples, shows real‑world case studies, and offers practical tuning steps—including kernel parameters, cgroup limits, and monitoring tools—to prevent system stalls and crashes.
Linux Memory Watermarks
Linux divides physical memory into zones (e.g., ZONE_NORMAL, ZONE_DMA, ZONE_HIGHMEM). Each zone has three watermarks: min (triggers direct reclaim), low (wakes kswapd), and high (stops kswapd). The watermarks determine whether allocation follows a fast path or incurs reclamation overhead.
Allocation Path
If free pages are above the high watermark, allocation is fast. Between low and high, allocation proceeds via a slower path that may wake kswapd. Below low, kswapd is awakened; below min, direct reclaim runs synchronously in the allocating thread.
kswapd and Direct Reclaim
kswapd scans LRU lists and reclaims three page types: file cache (clean pages are dropped, dirty pages are written back), anonymous pages (swapped out), and slab cache (kernel objects freed via shrinkers). If free memory drops below the min watermark, the kernel performs direct reclaim, blocking the allocating thread until enough pages are freed.
Three‑Level Control Mechanism
Watermark Check in Allocation
struct page *__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
struct zonelist *zonelist, nodemask_t *nodemask)
{
struct page *page;
struct zone *zone;
for_each_zone_zonelist(zone, zonelist, gfp_mask) {
if (zone_watermark_ok(zone, order, high_wmark_pages(zone), 0, gfp_mask)) {
page = rmqueue(zone, order);
if (page)
return page;
}
}
return NULL; // fall back to reclamation
}kswapd Workflow
kswapd is awakened when free memory falls below the low watermark.
It calls balance_pgdat to iterate over zones.
For each zone, shrink_zone scans LRU lists and reclaims pages.
kswapd stops when the high watermark is reached.
Direct Reclaim Path
When memory is below the min watermark, the kernel calls __alloc_pages_direct_reclaim, which invokes shrink_node → shrink_zone synchronously, blocking the allocating thread. If reclamation still fails, the OOM Killer is invoked.
Role of min_free_kbytes
The kernel parameter min_free_kbytes defines the global minimum free memory (KB). Each zone’s min watermark is derived from this value. Increasing min_free_kbytes raises the min watermark, causing earlier kswapd activation and reducing the chance of direct reclaim.
OOM Killer – The Final Safeguard
How OOM Killer Works
When memory is exhausted and reclamation cannot free enough pages, the kernel calls out_of_memory(). It computes an oom_score for each process based on memory usage and priority, then terminates the highest‑scoring process to free memory.
Influencing OOM Decisions
Adjust /proc/[PID]/oom_score_adj (range -1000 to 1000) to protect critical processes ( -1000) or make others more likely to be killed.
Kernel parameters /proc/sys/vm/overcommit_memory (0,1,2) control over‑commit behavior. /proc/sys/vm/panic_on_oom decides whether to panic or invoke the killer.
Practical Tuning
Inspect Watermarks
cat /proc/zoneinfo | grep -E "Node|min|low|high"Adjust vm.min_free_kbytes
For a 16 GiB server, set vm.min_free_kbytes=335544 (≈320 MiB, ~2 % of RAM).
Temporary change: sysctl -w vm.min_free_kbytes=335544 Permanent change: add vm.min_free_kbytes = 335544 to /etc/sysctl.conf and run sysctl -p.
Verify the Effect
cat /proc/zoneinfo | grep -E "Node|min|low|high"Higher free page counts and fewer direct‑reclaim events indicate successful tuning.
cgroup Memory Limits
Mount the cgroup filesystem (usually auto‑mounted at /sys/fs/cgroup or /sys/fs/cgroup/unified).
Create a memory cgroup, e.g., mkdir /sys/fs/cgroup/memory/my_group.
Set memory.limit_in_bytes (e.g., echo 536870912 > memory.limit_in_bytes for 512 MiB).
Add processes by writing their PIDs to tasks (cgroup v1) or cgroup.procs (cgroup v2).
Kernel Parameters for Memory Management
vm.watermark_scale_factor– adjusts aggressiveness of watermark calculation. vm.swappiness – controls swap tendency (0‑100). vm.min_free_kbytes – sets the global minimum free memory.
Monitoring Tools
dstat -m– real‑time memory statistics. vmstat 1 – per‑second memory, swap, and CPU information. top – interactive view; sort by memory with Shift+M.
Example: High‑Memory Allocation Triggering OOM
#include <iostream>
#include <vector>
#include <unistd.h>
void processOrderData(int orderCount) {
std::vector<char*> memoryBlocks;
try {
for (int i = 0; i < orderCount; ++i) {
char* block = new char[10 * 1024 * 1024]; // 10 MiB per order
memoryBlocks.push_back(block);
memset(block, '0', 10 * 1024 * 1024);
if (i % 1000 == 0)
std::cout << "Processed " << i << " orders" << std::endl;
}
} catch (const std::bad_alloc& e) {
std::cerr << "Allocation failed: " << e.what() << std::endl;
}
while (true) sleep(1); // keep process alive
}
int main() {
std::cout << "Start" << std::endl;
processOrderData(100000); // ~1 TiB allocation
return 0;
}This program illustrates how unchecked memory allocation can exhaust system memory, trigger direct reclaim, and eventually invoke the OOM killer.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
