Cloud Native 8 min read

Case Study: Intermittent Container Timeout Issues – Analysis and Resolution

This article presents a detailed case study of intermittent container timeout problems in a Kubernetes environment, examining kernel upgrades, NUMA configurations, CPU affinity bindings, kubelet behavior, cadvisor overhead, and hardware faults, and outlines the investigative steps and solutions applied.

Ctrip Technology

Nov 21, 2019

Case Study: Intermittent Container Timeout Issues – Analysis and Resolution

The authors, technical experts from Ctrip's system R&D department, describe a series of intermittent timeout incidents affecting container workloads after a kernel upgrade to 4.14.67. Initial perf analysis showed increased max delay on the last four CPU cores of certain hosts, suggesting factors beyond the kernel.

NUMA and CPU Affinity Binding – The team discovered that the affected hosts had a different NUMA node layout, causing the Kubernetes processes to be bound to the last four cores, which crossed NUMA boundaries and introduced high scheduling delays. Removing the CPU binding eliminated the delay.

New Issue – Months later, similar timeouts reappeared without a clear pattern. Perf and turbostat indicated irregular scheduling delays and TSC frequency jumps. Systematic testing isolated the problem to kubelet behavior, which logged long housekeeping operations (>100 ms) due to frequent metric collection.

Investigation of kubelet code revealed that cadvisor’s metric collection was the culprit. References to related GitHub issues confirmed high CPU consumption by cadvisor. A temporary mitigation of clearing caches (echo 2 > /proc/sys/vm/drop_caches) reduced the symptoms, but the proper fix required reducing metric collection frequency or isolating kubelet from user workloads.

Hardware Fault – Some timeout cases showed no scheduling delay but exhibited TSC instability. A lightweight TSC monitoring service was deployed, logging significant TSC jumps. Correlating these logs with incidents identified a batch of hosts from a specific vendor suffering hardware faults, which were resolved by a BIOS upgrade.

Conclusion – The two-part series documents the comprehensive troubleshooting process for intermittent container timeouts, emphasizing hypothesis-driven investigation, the importance of NUMA awareness, careful metric collection, and hardware validation. The lessons learned reinforce diligent, systematic analysis for complex cloud‑native operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Operations Kubernetes container numa CPU affinity Hardware Fault

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.