Why Does a Kubernetes Pod IP Disappear? The Hidden Second Sandbox Bug
UK8S’s custom CNI plugin integrates VPC networking to give containers native cloud performance, but a bug caused kubelet to create a second sandbox container, leading to missing NETNS parameters and VPC IP leaks; the article details the investigation, root‑cause analysis, and the patch fixing the issue.
Deeply Integrated VPC Network Solution
UK8S, UCloud’s Kubernetes container cloud product, fully supports the native Kubernetes API and provides a one‑stop cloud‑based Kubernetes service. The team developed a custom CNI (Container Network Interface) plugin that tightly integrates with UCloud’s VPC, giving container workloads the same network performance as cloud VMs (up to 10 Gb/s, 1 Mpps) and bridging containers with both public and physical clouds.
Advantages of the Solution
No overlay network, high performance: tests on 50 nodes show only a 3‑5% packet‑per‑second loss compared with VM‑to‑VM traffic, and performance does not degrade as the cluster scales.
Pods can directly reach public‑cloud and physical‑cloud resources, eliminating the extra network hop required by Flannel host‑gw mode.
CNI Workflow
When a Pod is created, the CNI plugin requests a VPC IP, configures a VethPair, and sets up policy routing. Deleting a Pod triggers the CNI DEL operation to release the VPC IP. The creation and deletion processes are illustrated below.
Pod IP Disappearance Issue Investigation and Resolution
To stress‑test the CNI plugin, a CronJob was deployed that runs a Job every minute, creating 1 440 Pods per day. The CronJob definition is:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: hello
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: hello
image: busybox
args:
- /bin/sh
- -c
- date; echo Hello from the Kubernetes cluster
restartPolicy: OnFailureEach Pod creation triggers a VPC IP request, and each Pod deletion should release that IP. Ideally the numbers of requests and releases should both be 1 440 per day. However, measurements showed >2 500 IP requests and >1 800 releases, indicating leaked IPs.
Log analysis revealed that during the CNI DEL step the plugin often received an empty NETNS environment variable, preventing it from locating the IP to release. The root cause was that kubelet sometimes created a second “sandbox” (infra) container for a Pod. When the first sandbox was killed, kubelet’s sync loop incorrectly decided the Pod needed a new sandbox, even though the Job’s main container had already completed.
Further digging into kubelet (v1.13.1) source showed:
The syncLoop watches PLEG events and invokes SyncPod. SyncPod checks that exactly one sandbox container is present and ready; if not, it rebuilds the sandbox.
When the first sandbox entered NOT READY, kubelet rebuilt it without considering that the Pod was already in Completed phase, leading to the creation of the unexpected second sandbox.
This second sandbox never received a proper network namespace because the housekeeping routine had already removed its cgroup, so the subsequent CNI DEL call received an empty NETNS, causing the IP leak.
The fix modifies SyncPod to skip sandbox recreation for Pods in the Completed phase. After recompiling and deploying the patched kubelet, the IP leak disappeared.
Additionally, the CNI plugin was enhanced to persist mapping information (Pod name, sandbox ID, namespace, VPC IP) after a successful ADD, allowing the DEL operation to locate and release the correct IP without relying on NETNS.
Conclusion
The investigation demonstrates how deep knowledge of Kubernetes internals—especially kubelet’s sync loop and sandbox handling—can uncover subtle bugs that affect production cloud‑native services. The UK8S team has submitted the kubelet patch upstream and plans further features such as autoscaling, GPU support, hybrid cloud, and Service Mesh.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
