Cloud Native 9 min read

How to Detect and Prevent OOM and CPU Throttling in Kubernetes Pods

This article explains why Kubernetes pods encounter out‑of‑memory errors and CPU throttling, how limits and requests influence resource allocation, and provides practical monitoring techniques using Prometheus and cAdvisor to proactively identify and mitigate these issues before they impact performance or cause pod eviction.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Detect and Prevent OOM and CPU Throttling in Kubernetes Pods

Introduction

When using Kubernetes, out‑of‑memory (OOM) errors and CPU throttling are major challenges for cloud applications.

Why does this happen?

CPU and memory requirements in cloud apps are increasingly important because they directly affect cloud costs.

By configuring limits and requests, you can control how a pod allocates memory and CPU, preventing resource scarcity and adjusting costs.

If a node lacks sufficient resources, a pod may be evicted or killed; when a process runs out of memory (OOM) it is terminated because it lacks required resources.

If CPU usage exceeds the set limit, the process will be throttled.

But how can you proactively monitor how close a Kubernetes pod is to OOM or CPU throttling?

Kubernetes OOM

Each container in a pod needs memory to run.

Kubernetes limits are set per container in the pod or Deployment definition.

All modern Unix systems have a way to kill a process when it needs to reclaim memory, marked as exit code 137 or OOMKilled.

State:          Running
      Started:      Thu, 10 Oct 2019 11:14:13 +0200
    Last State:    Terminated
      Reason:      OOMKilled
      Exit Code:   137
      Started:     Thu, 10 Oct 2019 11:04:03 +0200
      Finished:    Thu, 10 Oct 2019 11:14:11 +0200

This exit code indicates the process used more memory than allowed and had to be terminated.

Linux provides an oom_score for each process and an oom_score_adj value that Kubernetes can use for quality‑of‑service decisions. The OOM Killer reviews processes and kills those exceeding their memory limits.

In Kubernetes a process can hit any of the following limits:

Kubernetes Limit set on the container.

Kubernetes ResourceQuota set on the namespace.

The actual memory size of the node.

Memory Overcommit

Limits can be higher than requests, so the sum of limits may exceed node capacity. This overcommit is common; if all containers use more memory than requested, the node can run out of memory, causing some pods to be killed.

Monitoring Kubernetes OOM

When using node exporter in Prometheus, there is a metric called node_vmstat_oom_kill. Tracking when OOM kills occur is important, but you may want to know before they happen.

Instead you can check how close a process is to its Kubernetes limits:

(sum by (namespace,pod,container)
 (rate(container_cpu_usage_seconds_total{container!=""}[5m]))
 / sum by (namespace,pod,container)
 (kube_pod_container_resource_limits{resource="cpu"})) > 0.8

Kubernetes CPU Throttling

CPU throttling is a behavior where a process slows down as it approaches its resource limits.

Similar to memory, these limits may be:

Kubernetes Limit set on the container.

Kubernetes ResourceQuota set on the namespace.

The actual CPU size of the node.

Think of a highway with traffic:

CPU is the road.

Vehicles represent processes, each with different sizes.

Multiple lanes represent multiple cores.

A request is a dedicated lane, like a bike lane. Throttling appears as a traffic jam: all processes run but slower.

Kubernetes CPU Shares

CPU in Kubernetes is handled via shares. Each CPU core is divided into 1024 shares, and the Linux kernel cgroups allocate shares among running processes.

If the CPU can handle all current processes, no action is needed. If a process uses more than 100 % CPU, shares become saturated. Kubernetes uses the Completely Fair Scheduler (CFS); processes with more shares receive more CPU time.

Unlike memory, Kubernetes does not kill a pod because of throttling.

You can view CPU statistics in /sys/fs/cgroup/cpu/cpu.stat

CPU Overcommit

As seen in the limits and requests article, setting limits or requests is important to control resource consumption. However, do not set total requests greater than the actual CPU size, as this would imply each container should have a certain amount of CPU.

Monitoring Kubernetes CPU Throttling

You can check how close a process is to its Kubernetes limits:

(sum by (namespace,pod,container)(rate(container_cpu_usage_seconds_total{container!=""}[5m]))
 / sum by (namespace,pod,container)(kube_pod_container_resource_limits{resource="cpu"})) > 0.8

To track cluster‑wide throttling, cAdvisor provides container_cpu_cfs_throttled_periods_total and container_cpu_cfs_periods_total. With these you can calculate the throttling percentage of all CPU cycles.

Best Practices

Pay Attention to Limits and Requests

Limits set the maximum resources a pod can use on a node, but they must be used carefully; otherwise a process may be limited or terminated.

Be Ready for Eviction

Setting very low requests may seem to give your process minimal CPU or memory, but kubelet will first evict pods whose usage exceeds their requests, making them the first to be killed.

If you need to protect specific pods from preemption, assign higher priority to critical processes.

Throttling Is a Silent Enemy

By setting unrealistic limits or overcommitting, you may not realize your processes are being throttled and performance suffers. Actively monitor CPU usage and understand the actual limits in your containers and namespaces.

Link: https://sysdig.com/blog/troubleshoot-kubernetes-oom/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringOOMresource-limitsCPU throttlingcAdvisor
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.