Cloud Native 7 min read

Choosing Between CPU‑Bound and Memory‑Bound Nodes for Kubernetes Scheduling

The article explains how to decide between a CPU‑saturated but memory‑free node and a memory‑saturated but CPU‑free node when scheduling a new Kubernetes pod, arguing that the node with ample memory (CPU‑busy) is generally preferable and describing the underlying kube‑scheduler considerations.

Refining Core Development Skills

Jul 27, 2022

Choosing Between CPU‑Bound and Memory‑Bound Nodes for Kubernetes Scheduling

Hello everyone, I am Fei! A few days ago I saw an interesting question that I shared on my Moments, and today I’m posting it again on my public account.

The question is: there are two servers, A and B. Server A has its CPU almost full but plenty of free memory, while server B has a free CPU but its memory is almost full. A new Kubernetes task needs to be scheduled— which server should be chosen? This is a classic Kubernetes scheduling scenario.

Some people’s first thought is to evaluate whether the new task is CPU‑intensive or I/O‑intensive, then decide where to schedule it. That reasoning isn’t wrong, but it misses the key point.

The key is to consider what problems might arise when the task is scheduled on a particular machine.

1. Scheduling to the CPU‑busy server A

If we schedule the task on server A, which has a high CPU load, what will happen? CPU resources are time‑shared; each process gets a time slice. Adding another process simply makes all processes run a bit slower. In other words, CPU is a compressible resource that can be oversold.

Some readers reported that their cloud VMs crash when CPU spikes to 100% because the cloud provider shuts down the host to protect it. This rarely happens in on‑premise IDC data centers, so we ignore that case here.

2. Scheduling to the memory‑busy server B

If we schedule the task on server B, which is low on memory, what could happen? You may have encountered an OOM‑kill situation where the operating system kills a running process because the physical memory is insufficient.

The OS does not always kill the process that consumes the most memory; it also considers the process’s oom_score_adj (configurable) value. In mixed‑workload servers, online services are often given a lower kill priority than offline services, protecting the stability of online services.

Assuming all services are online, any process killed by Linux’s OOM killer has a huge impact: the service must be rescheduled, and stability and correct responses may be affected.

Some might say Linux can swap memory to disk, but in production servers swapping is usually disabled because disk performance is far slower than RAM, and swapping would cause a drastic performance drop.

Conclusion

Therefore, when scheduling a new task, you should prefer server A because it has abundant free memory, making OOM‑kill unlikely. Although its CPU is busy, the services can still run.

In practice, after the Kubernetes API server receives a Pod creation request, the scheduler (kube‑scheduler) selects the best available node from the cluster to run the Pod.

Of course, real‑world scheduling is more complex. Besides the default kube‑scheduler, there are strategies such as specifying a node name, node affinity, pod affinity, nodeSelector, etc.

Even the default scheduler evaluates individual and aggregate resource requests, hardware/software/policy constraints, affinity/anti‑affinity rules, data locality, inter‑pod interference, and other factors to score nodes and pick the highest‑scoring one.

Finally, I also attach a very thorough answer from a diligent reader, which is highly valuable.