Case Study: Intermittent Container Timeout Issues – Analysis and Resolution
This article presents a detailed case study of intermittent container timeout problems in a Kubernetes environment, examining kernel upgrades, NUMA configurations, CPU affinity bindings, kubelet behavior, cadvisor overhead, and hardware faults, and outlines the investigative steps and solutions applied.
The authors, technical experts from Ctrip's system R&D department, describe a series of intermittent timeout incidents affecting container workloads after a kernel upgrade to 4.14.67. Initial perf analysis showed increased max delay on the last four CPU cores of certain hosts, suggesting factors beyond the kernel.
NUMA and CPU Affinity Binding – The team discovered that the affected hosts had a different NUMA node layout, causing the Kubernetes processes to be bound to the last four cores, which crossed NUMA boundaries and introduced high scheduling delays. Removing the CPU binding eliminated the delay.
New Issue – Months later, similar timeouts reappeared without a clear pattern. Perf and turbostat indicated irregular scheduling delays and TSC frequency jumps. Systematic testing isolated the problem to kubelet behavior, which logged long housekeeping operations (>100 ms) due to frequent metric collection.
Investigation of kubelet code revealed that cadvisor’s metric collection was the culprit. References to related GitHub issues confirmed high CPU consumption by cadvisor. A temporary mitigation of clearing caches (echo 2 > /proc/sys/vm/drop_caches) reduced the symptoms, but the proper fix required reducing metric collection frequency or isolating kubelet from user workloads.
Hardware Fault – Some timeout cases showed no scheduling delay but exhibited TSC instability. A lightweight TSC monitoring service was deployed, logging significant TSC jumps. Correlating these logs with incidents identified a batch of hosts from a specific vendor suffering hardware faults, which were resolved by a BIOS upgrade.
Conclusion – The two-part series documents the comprehensive troubleshooting process for intermittent container timeouts, emphasizing hypothesis-driven investigation, the importance of NUMA awareness, careful metric collection, and hardware validation. The lessons learned reinforce diligent, systematic analysis for complex cloud‑native operations.
Ctrip Technology
Official Ctrip Technology account, sharing and discussing growth.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.