Uncovering the Root Causes of ACK Cluster Network Latency: kubelet, softirq, and cgroup Insights
A detailed post‑mortem explains how excessive cgroup files, kubelet's sys‑CPU usage, soft‑interrupt scheduling delays, and a buggy page‑free routine caused intermittent hundreds‑of‑milliseconds network latency in an Alibaba Cloud ACK cluster, and how targeted CPU binding and kernel patches resolved the issue.
A Kubernetes cluster on Alibaba Cloud (ACK) exhibited intermittent network latency of several hundred milliseconds during inter‑container RPC calls, with occasional spikes up to 2 seconds. Packet captures showed normal packets and retransmissions arriving within a 400 ms window, suggesting delayed processing in the kernel’s soft‑interrupt (ksoftirqd) path.
Initial Diagnosis
CPU profiling identified unusually high sys time for the kubelet process. Further inspection with net‑exporter and net_softirq revealed that kubelet repeatedly opened cgroup files – more than 100 000 opens in a 10‑second interval – reading paths such as:
/sys/fs/cgroup/cpuset/kubepods.slice/kubepods-burstable.slice/kubepods-burstable-pod.../docker-.../uid_0/pid_673/cpuset.cpusCompared with a healthy cluster (≈3 k cgroup entries), the problematic cluster contained roughly 182 k entries. The excess hierarchy originated from Android‑based containers that create additional uid / pid layers.
Impact of kubelet
Profiling showed that kubelet’s massive syscalls while reading these cgroup files monopolized CPU in kernel mode, starving ksoftirqd and delaying packet‑receive processing.
First Mitigation
Binding kubelet to CPUs that are isolated from the network‑receive interrupt (net_rx) CPUs reduced the frequency of >100 ms delays, but occasional spikes persisted.
Deeper Kernel Investigation
Using the sysAK nosched tool, the team observed that ksoftirqd itself sometimes ran for extended periods, independent of kubelet. The latency exhibited two regular patterns:
Execution was confined to a single CPU core (initially CPU 0, later CPU 24 after isolation).
Occurrences repeated roughly every 3 hours 10 minutes, regardless of traffic load.
Applying sysAK irqoff uncovered a user‑space process ( icr_encoder) that held zone->lock for a long time during page‑free operations. The process invoked pagetypeinfo_showfree_print (exposed via /proc/pagetypeinfo), which traverses all pages while holding the zone lock.
static void __free_pages_ok(struct page *page, unsigned int order) {
unsigned long flags;
int migratetype;
unsigned long pfn = page_to_pfn(page);
if (!free_pages_prepare(page, order, true))
return;
migratetype = get_pfnblock_migratetype(page, pfn);
local_irq_save(flags);
__count_vm_events(PGFREE, 1 << order);
free_one_page(page_zone(page), page, pfn, order, migratetype);
local_irq_restore(flags);
}
static void free_one_page(struct zone *zone, struct page *page,
unsigned long pfn, unsigned int order, int migratetype) {
spin_lock(&zone->lock);
if (unlikely(has_isolate_pageblock(zone) ||
is_migrate_isolate(migratetype))) {
migratetype = get_pfnblock_migratetype(page, pfn);
}
__free_one_page(page, pfn, zone, order, migratetype, true);
spin_unlock(&zone->lock);
}When the number of pages in a zone exceeds ~100 k, the lock is held long enough to block other kernel activities, including ksoftirqd, leading to the observed latency.
Additional Contributing Factors
Older ipvs versions contained a buggy estimation_timer that caused sporadic stalls. The issue was fixed upstream (commit https://github.com/alibaba/cloud-kernel/commit/265287e4c2d2ca3eedd0d3c7c91f575225afd70f).
NUMA‑related page‑migration delays also contributed to occasional stalls.
Final Resolutions
Bind kubelet to CPUs that are offset from the CPUs handling net_rx interrupts, ensuring that kubelet’s heavy syscalls do not compete with ksoftirqd.
Deploy a periodic cronjob to reclaim memory fragmentation, reducing the number of pages that need to be traversed during pagetypeinfo_showfree_print.
Upgrade ipvs to a version without the estimation_timer bug and tune NUMA page‑migration settings.
After applying these changes, latency events became rare and the cluster met delivery‑grade performance targets. The case demonstrates how cgroup proliferation from specialized containers, excessive kubelet syscalls, and page‑free lock contention can combine to produce severe network latency in cloud‑native environments.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
