Why Kubelet Fails to Reach the API Server: LB Issues and HTTP/2 Pitfalls
This article analyzes a Kubernetes cluster outage caused by a load‑balancer failure that prevented kubelet from connecting to the API server, explores the underlying HTTP/2 behavior, and presents debugging steps and code‑level fixes to restore reliable communication.
Background
Kubernetes uses a master‑slave architecture where the master node runs the API server, the central component handling all requests and persisting state to etcd. High availability is typically achieved by deploying multiple API server instances behind a load balancer (LB). If the LB fails, all nodes may become NotReady, triggering massive pod eviction.
Fault Occurrence
During an incident, many nodes reported NotReady. The kubelet logs showed repeated errors such as:
E0415 17:03:11.351872 16624 kubelet_node_status.go:374] Error updating node status, will retry: error getting node "k8s-slave88": Get https://10.13.10.12:6443/api/v1/nodes/k8s-slave88?resourceVersion=0&timeout=5s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)The IP 10.13.10.12 is the LB address, indicating kubelet could not reach the API server despite successful telnet tests.
Diagnosis
Using tcpdump revealed that kubelet sent packets to the API server but received no ACKs. Restarting kubelet temporarily resolved the issue, confirming a connection‑level problem.
The root cause was traced to the LB: when a new LB instance took over traffic, it dropped connections it could not map, causing kubelet to hang.
Hard Fix
Investigation of client-go showed the default transport uses HTTP/2. Setting the environment variable DISABLE_HTTP2 forces HTTP/1.1, which avoids the hang because HTTP/1.1 reuses idle connections or creates new ones on failure.
HTTP/2 keeps a single connection per host; if that connection stalls, the client may wait minutes before the OS closes it. HTTP/1.1, by contrast, opens new connections when needed, allowing quicker recovery.
Implementation Details
Key code snippets:
// SetTransportDefaults applies defaults and optionally enables HTTP/2
func SetTransportDefaults(t *http.Transport) *http.Transport {
t = SetOldTransportDefaults(t)
if s := os.Getenv("DISABLE_HTTP2"); len(s) > 0 {
klog.Infof("HTTP2 has been explicitly disabled")
} else {
if err := http2.ConfigureTransport(t); err != nil {
klog.Warningf("Transport failed http2 configuration: %v", err)
}
}
return t
}Because the standard transport does not send HTTP/2 PING frames, stale connections are not detected promptly. A community PR added Ping‑frame support, which was later back‑ported to Kubernetes 1.14.
Turning Point
Testing on Kubernetes v1.10.11 showed the issue disappeared, indicating it was a regression in v1.10.2 that was later fixed. The final fix involved restoring the closeAllConns function to forcefully close all connections on error, which was merged upstream.
References
https://github.com/kubernetes/kubernetes/issues/41916
https://github.com/kubernetes/kubernetes/issues/48638
https://github.com/kubernetes-incubator/kube-aws/issues/598
https://github.com/kubernetes/client-go/issues/374
https://github.com/kubernetes/kubernetes/pull/63492
https://github.com/kubernetes/kubernetes/pull/71174
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
