How to Detect and Resolve Kernel Memory & CPU Latency in Kubernetes Clusters
In cloud‑native Kubernetes environments, resource over‑commit and mixed deployments can cause kernel‑level memory reclaim and CPU scheduling delays that manifest as application jitter, and this article explains how to visualize, diagnose, and remediate those delays using the SysOM exporter and related metrics.
Background
In cloud‑native scenarios, many clusters adopt resource over‑commit and mixed deployment to maximize utilization. While this improves efficiency, it also raises contention between the host and containerized applications, leading to kernel‑level delays such as CPU latency and Memory Reclaim Latency that propagate to the application layer, causing response‑time jitter or even service disruption.
Memory Reclaim Latency
When a process requests memory and the system or container reaches a low‑watermark, the kernel triggers asynchronous reclamation (kswapd). If memory falls below a minimum watermark, the kernel enters direct reclaim and direct compaction phases, which can block the process for a noticeable period.
Direct reclaim : the process blocks while the kernel synchronously reclaims memory because of severe memory pressure.
Direct compaction : the process blocks while the kernel compacts fragmented memory into a contiguous region.
Both actions increase CPU usage and can cause long‑lasting latency spikes, leading to jitter in latency‑sensitive workloads.
Typical Delay Scenarios
CASE 1: Container memory limit reached → direct reclaim/compaction blocks the container process.
CASE 2: Host memory shortage → node memory below min watermark triggers direct reclaim for containers.
CASE 3: Long run‑queue wait time → processes stay in the ready queue too long before being scheduled.
CASE 4: Prolonged interrupt handling → heavy interrupt storms keep the CPU occupied, preventing timely scheduling.
CASE 5: Kernel path holding spin locks → long‑running kernel paths block soft‑IRQ processing, causing network jitter.
Identifying System Delays with SysOM
The ACK team collaborated with the OS team to launch SysOM (System Observer Monitoring) , a kernel‑level container monitoring feature available on Alibaba Cloud. The SysOM dashboards provide visibility into both node‑level and pod‑level metrics.
In the Pod Memory Monitor view, watch Memory Global Direct Reclaim Latency , Memory Direct Reclaim Latency , and Memory Compact Latency to see how long pods are blocked by direct reclaim or compaction.
In the System Memory node view, the Memory Others chart shows the page‑scan count ( pgscan_direct) during direct reclaim; a non‑zero value indicates reclaim activity.
Metric Details
Memory Direct Reclaim Latency reports the incremental count of reclaim events grouped by latency ranges (e.g., memDrcm_lat_1to10ms, memDrcm_glb_lat_10to100ms) triggered when container memory usage hits its limit or node free memory drops below the min watermark.
Memory Compact Latency reflects the incremental count of compaction events caused by excessive node memory fragmentation.
Resolving Memory‑Related Delays
Use the Node/Pod Memory Panorama feature to break down memory consumption (Pod Cache, InactiveFile, InactiveAnon, Dirty Memory) and locate memory “black holes”.
Enable Koordinator QoS fine‑grained scheduling to adjust memory watermarks and trigger earlier asynchronous reclamation, reducing the impact of direct reclaim.
CPU Scheduling Delay
CPU delay is the interval from a task becoming runnable to being selected by the scheduler. Prolonged CPU delay can cause network‑level latency (e.g., delayed packet processing).
Monitor the System CPU and Schedule dashboard for metrics such as WaitOnRunq Delay (average time processes spend in the run‑queue) and Sched Delay Count (distribution of intervals with no scheduling activity). Spikes above 50 ms indicate serious scheduling jitter.
Case Study: CPU Delay Causing Network Jitter
A financial‑industry customer observed frequent Redis connection failures on two ACK nodes. Investigation revealed kernel packet‑receive latency > 500 ms, leading to Redis client disconnects.
Examining the Sched Delay Count chart showed many > 1 ms spikes, suggesting prolonged CPU idle periods where ksoftirq could not run.
OS console diagnostics displayed both scheduling jitter and cgroup leak anomalies.
Further analysis linked the issue to a memory cgroup leak caused by a cronjob that read logs, leaving page cache in a zombie cgroup.
Resolution steps:
Temporary: drop caches to free page cache and allow the zombie cgroup to be cleaned.
Permanent: enable Alinux’s zombie‑cgroup reclamation feature (see reference [5]).
Further Diagnosis Tools
For deeper root‑cause analysis of scheduling jitter, use the Scheduling Jitter Diagnosis feature in the Alibaba Cloud OS console.
References
SysOM kernel‑level container monitoring: https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/sysom-kernel-level-container-monitoring
Memory Panorama analysis: https://help.aliyun.com/zh/alinux/user-guide/memory-panorama-analysis-function-instructions
Container memory QoS: https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/memory-qos-for-containers
Scheduling jitter diagnosis: https://help.aliyun.com/zh/alinux/user-guide/scheduling-jitter-diagnosis
Alinux resource isolation guide: https://openanolis.cn/sig/Cloud-Kernel/doc/659601505054416682
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
