Cloud Native 11 min read

How to Diagnose and Fix Memory & CPU Latency Issues in Cloud‑Native Kubernetes Clusters

This article explains why resource over‑commit in cloud‑native Kubernetes clusters leads to memory and CPU latency, shows how to visualize kernel delays with the ack‑sysom‑monitor exporter, outlines common latency scenarios, and provides step‑by‑step troubleshooting and remediation guidance.

Alibaba Cloud Observability

Nov 10, 2025

How to Diagnose and Fix Memory & CPU Latency Issues in Cloud‑Native Kubernetes Clusters

Background

In cloud‑native environments, resource over‑commit and mixed deployment improve utilization but increase competition between host and containerized applications, leading to latency issues.

Memory Allocation Latency

Kernel‑level delays such as CPU latency and Memory Reclaim Latency propagate to the application layer, causing response time jitter and business instability, especially for latency‑sensitive services.

Observability Challenge

Without sufficient observability data, engineers struggle to correlate application jitter with system‑level delays. This article demonstrates using the ack‑sysom‑monitor Exporter in Kubernetes to visualize and pinpoint kernel latency.

Direct Memory Reclaim and Compact

When a process requests memory and the system or container memory reaches low watermarks, the kernel performs direct memory reclaim (kswapd) or direct memory compaction, which can block the process and cause long delays.

Direct Memory Reclaim : Process blocks waiting for synchronous memory reclamation.

Direct Memory Compaction : Process blocks while the kernel compacts fragmented memory.

Both actions increase CPU usage and can cause noticeable latency spikes.

Typical Scenarios

Case 1 : Container memory limit reached, triggering direct reclaim and compaction.

Case 2 : Host memory low, causing containers to experience direct reclaim.

Case 3 : Long ready‑queue wait times delay task scheduling.

Case 4 : Prolonged interrupt handling blocks CPU.

Case 5 : Kernel path holds spin lock, delaying soft‑irq processing.

Identifying the Issue

Use SysOM container system monitoring dashboards (Pod Memory Monitor, System Memory) to view metrics such as Memory Global Direct Reclaim Latency , Memory Direct Reclaim Latency , Memory Compact Latency , WaitOnRunq Delay , and Sched Delay Count .

Resolution

Optimize memory usage, enable zombie cgroup reclamation, adjust memory watermarks with Koordinator QoS, and use the Alibaba Cloud OS console for detailed diagnosis.

Case Study

A financial client experienced Redis connection failures due to kernel packet‑receive delay. By correlating Sched Delay Count spikes with SysOM dashboards and OS console diagnostics, the root cause was identified as a memory cgroup leak caused by a cron‑job that read logs, leaving zombie cgroups.

Temporary mitigation involved dropping caches; permanent fix used Alinux zombie cgroup reclamation.