How to Diagnose and Fix Memory & CPU Latency Issues in Cloud‑Native Kubernetes Clusters
This article explains why resource over‑commit in cloud‑native Kubernetes clusters leads to memory and CPU latency, shows how to visualize kernel delays with the ack‑sysom‑monitor exporter, outlines common latency scenarios, and provides step‑by‑step troubleshooting and remediation guidance.
Background
In cloud‑native environments, resource over‑commit and mixed deployment improve utilization but increase competition between host and containerized applications, leading to latency issues.
Memory Allocation Latency
Kernel‑level delays such as CPU latency and Memory Reclaim Latency propagate to the application layer, causing response time jitter and business instability, especially for latency‑sensitive services.
Observability Challenge
Without sufficient observability data, engineers struggle to correlate application jitter with system‑level delays. This article demonstrates using the ack‑sysom‑monitor Exporter in Kubernetes to visualize and pinpoint kernel latency.
Direct Memory Reclaim and Compact
When a process requests memory and the system or container memory reaches low watermarks, the kernel performs direct memory reclaim (kswapd) or direct memory compaction, which can block the process and cause long delays.
Direct Memory Reclaim : Process blocks waiting for synchronous memory reclamation.
Direct Memory Compaction : Process blocks while the kernel compacts fragmented memory.
Both actions increase CPU usage and can cause noticeable latency spikes.
Typical Scenarios
Case 1 : Container memory limit reached, triggering direct reclaim and compaction.
Case 2 : Host memory low, causing containers to experience direct reclaim.
Case 3 : Long ready‑queue wait times delay task scheduling.
Case 4 : Prolonged interrupt handling blocks CPU.
Case 5 : Kernel path holds spin lock, delaying soft‑irq processing.
Identifying the Issue
Use SysOM container system monitoring dashboards (Pod Memory Monitor, System Memory) to view metrics such as Memory Global Direct Reclaim Latency , Memory Direct Reclaim Latency , Memory Compact Latency , WaitOnRunq Delay , and Sched Delay Count .
Resolution
Optimize memory usage, enable zombie cgroup reclamation, adjust memory watermarks with Koordinator QoS, and use the Alibaba Cloud OS console for detailed diagnosis.
Case Study
A financial client experienced Redis connection failures due to kernel packet‑receive delay. By correlating Sched Delay Count spikes with SysOM dashboards and OS console diagnostics, the root cause was identified as a memory cgroup leak caused by a cron‑job that read logs, leaving zombie cgroups.
Temporary mitigation involved dropping caches; permanent fix used Alinux zombie cgroup reclamation.
Alibaba Cloud Observability
Driving continuous progress in observability technology!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
