How Linux’s OOM Killer Works: Inside the Kernel’s Memory‑Reclaim Mechanism
This article explains the Linux OOM Killer’s trigger and selection process, walks through the key kernel functions and heuristics that choose a victim process, and clarifies why out‑of‑memory conditions occur under different overcommit settings.
When the system cannot allocate enough memory, the Linux kernel invokes the OOM Killer (Out‑Of‑Memory Killer) to free memory by terminating a selected process, ensuring system stability.
1. Trigger Process
Memory allocation starts with alloc_page(), which eventually calls __alloc_pages(). If repeated reclaim and compaction fail, __alloc_pages_slowpath invokes __alloc_pages_may_oom to start the OOM path.
/* If we failed to make any progress reclaiming, then we are
running out of options and have to consider going OOM */
if (!did_some_progress) {
if (oom_gfp_allowed(gfp_mask)) {
if (oom_killer_disabled)
goto nopage;
/* Coredumps can quickly deplete all memory reserves */
if ((current->flags & PF_DUMPCORE) && !(gfp_mask & __GFP_NOFAIL))
goto nopage;
page = __alloc_pages_may_oom(gfp_mask, order,
zonelist, high_zoneidx,
nodemask, preferred_zone,
classzone_idx, migratetype);
...
}
}If oom_killer_disabled is set, the kernel skips the OOM mechanism.
2. Working Process (Linux 3.18)
When memory shortage is detected, out_of_memory calls select_bad_process to pick a victim:
p = select_bad_process(&points, totalpages, mpol_mask, force_kill);
The selection logic is implemented in select_bad_process:
static struct task_struct *select_bad_process(unsigned int *ppoints,
unsigned long totalpages, const nodemask_t *nodemask,
bool force_kill)
{
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
unsigned long chosen_points = 0;
rcu_read_lock();
for_each_process_thread(g, p) {
unsigned int points;
switch (oom_scan_process_thread(p, totalpages, nodemask, force_kill)) {
case OOM_SCAN_SELECT:
chosen = p;
chosen_points = ULONG_MAX;
/* fall through */
case OOM_SCAN_CONTINUE:
continue;
case OOM_SCAN_ABORT:
rcu_read_unlock();
return (struct task_struct *)(-1UL);
case OOM_SCAN_OK:
break;
};
points = oom_badness(p, NULL, nodemask, totalpages);
if (!points || points < chosen_points)
continue;
if (points == chosen_points && thread_group_leader(chosen))
continue;
chosen = p;
chosen_points = points;
}
if (chosen)
get_task_struct(chosen);
rcu_read_unlock();
*ppoints = chosen_points * 1000 / totalpages;
return chosen;
}The helper oom_badness computes a heuristic score based on RSS, page‑table entries, swap usage, root‑process bonus, and oom_score_adj:
unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
const nodemask_t *nodemask, unsigned long totalpages)
{
long points;
long adj;
if (oom_unkillable_task(p, memcg, nodemask))
return 0;
p = find_lock_task_mm(p);
if (!p)
return 0;
adj = (long)p->signal->oom_score_adj;
if (adj == OOM_SCORE_ADJ_MIN) {
task_unlock(p);
return 0;
}
points = get_mm_rss(p->mm) + atomic_long_read(&p->mm->nr_ptes) +
get_mm_counter(p->mm, MM_SWAPENTS);
task_unlock(p);
if (has_capability_noaudit(p, CAP_SYS_ADMIN))
points -= (points * 3) / 100;
adj *= totalpages / 1000;
points += adj;
return points > 0 ? points : 1;
}If no suitable process is found, the kernel panics; otherwise it calls oom_kill_process to terminate the chosen victim and, if necessary, its most memory‑hungry child.
3. Why Does Out‑Of‑Memory Happen?
Physical pages are allocated only when a process actually accesses a virtual address. Overcommit allows applications to reserve more virtual memory than physical RAM, so the kernel may later run out of real pages, triggering the OOM Killer.
The kernel’s vm.overcommit_memory setting controls this behavior:
0 – Default heuristic overcommit; obvious overcommit is rejected.
1 – Always overcommit; no checks.
2 – No overcommit; total requested address space must not exceed CommitLimit.
4. Summary
Because of the physical‑memory allocation model and overcommit, the OOM Killer is invoked when RAM is insufficient. It selects the process with the highest memory “badness” score—typically the largest memory consumer—to free the most RAM with minimal impact, preferring a smart choice over a random kill or a system crash.
killing a random task (bad), letting the system crash (worse)
OR try to be smart about which process to kill. Note that we
don't have to be perfect here, we just have to be good.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
