Kubernetes Networking Unpacked: How a Service Timeout Reveals iptables‑CNI Collaboration
A real‑world Service timeout in a high‑traffic e‑commerce cluster exposed a saturated conntrack table, prompting a step‑by‑step dissection of Pods, Services, iptables, conntrack, CNI plugins, DNS and NetworkPolicy, and culminating in concrete production‑grade remediation tactics.
1. Incident Trigger – Service Timeout
During a flash‑sale load test (~30 k QPS) the order-service calling user-service began to see RPC timeouts of 1‑5 seconds. CPU, memory, GC and DB pools were normal, but some nodes showed severe latency while others were fine. Initial guesses (thread‑pool exhaustion, small Dubbo/Feign pools, CoreDNS jitter, Istio sidecar overhead) proved wrong.
Running conntrack -S and inspecting /proc/sys/net/netfilter/nf_conntrack_count revealed a continuously growing insert_failed counter, a rising drop counter, and nf_conntrack_count approaching nf_conntrack_max. The conclusion: the node’s connection‑tracking table was full, preventing new NAT state creation and causing random timeouts and retry storms.
2. Building the Full‑Picture – Who Does What in K8s Networking
The common simplification “Pod network is CNI, Service is kube‑proxy” hides many moving parts. In production the responsibilities are:
Linux Network Namespace : isolates each Pod’s network stack; the key question is why eth0 can exist independently.
veth pair : bridges the Pod namespace and the host namespace; it answers how packets leave the container.
Bridge / routing table : performs host‑side forwarding and routing; it decides whether traffic stays on‑node or goes cross‑node.
CNI plugin : allocates Pod IPs and injects routes, tunnels or policies (e.g., Calico, Flannel, Cilium) to make Pods reachable.
kube‑proxy : installs iptables/IPVS/eBPF rules that translate a Service’s virtual IP to backend Pod IPs (ClusterIP/NodePort).
netfilter/iptables : executes NAT and filtering at kernel hooks (PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING).
conntrack : records each connection’s NAT mapping so that return packets can be reverse‑translated; when the table fills, timeouts appear.
CoreDNS : resolves Service names to ClusterIP; latency here can make the first request appear slow.
NetworkPolicy : L3/L4 access control that can silently reject traffic.
In short, K8s networking is a data‑path composed of multiple layers, not a single component.
3. Netfilter, iptables and conntrack – How They Work Together
Linux provides five netfilter hook points: PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING. iptables is a user‑space tool that writes rules into the netfilter tables. The two tables most relevant to K8s are: nat – performs DNAT/SNAT for Service translation and outbound masquerading. filter – implements access control, often driven by NetworkPolicy.
When a Service request arrives, iptables rewrites the destination from the ClusterIP to the selected Pod IP (DNAT). However, a single address rewrite is insufficient; the kernel must remember the mapping. conntrack stores this state so that the reply packet can be reverse‑NATed. If conntrack cannot allocate a new entry, the packet is dropped, manifesting as random timeouts.
4. End‑to‑End Service Call Flow
Using a typical request from pod‑a to Service backend‑svc (ClusterIP 10.96.32.15:80) the steps are:
DNS lookup : pod‑a queries CoreDNS, which returns the ClusterIP.
Packet leaves the Pod : the packet travels from eth0 through the veth pair into the host’s bridge or routing stack.
iptables DNAT (PREROUTING/OUTPUT): kube‑proxy rules match the ClusterIP and rewrite the destination to the backend Pod IP (e.g., 10.244.2.18:8080).
Routing decision : if the backend Pod is on the same node, the packet is forwarded via the bridge/veth; otherwise the CNI‑provided route (BGP, VXLAN, IP‑in‑IP) sends it to the remote node.
Backend Pod reply : conntrack uses the stored NAT entry to reverse‑translate the source address, so the client still sees backend‑svc as the endpoint.
The diagram in the original article (omitted here) visualizes these hops.
5. What Actually Drives CNI
Many assume iptables creates Pod‑to‑Pod connectivity, but the reality is that CNI sets up the underlying network fabric (IP allocation, bridge or overlay, routing). kube‑proxy only adds the Service abstraction on top.
CNI responsibilities during Pod creation:
Create a network‑namespace‑side interface.
Create the matching veth pair.
Allocate a Pod IP.
Attach the Pod to a bridge or routing device.
Inject routes, neighbor entries, tunnels or policy rules on the host.
Typical CNI implementations differ in data‑plane:
Flannel (VXLAN) : simple, easy to use, but adds encapsulation overhead and limited observability.
Calico (BGP / VXLAN / IPIP) : flexible, strong NetworkPolicy support, slightly higher configuration complexity.
Cilium (eBPF) : high performance, rich observability, can replace kube‑proxy, but steeper learning curve.
Because the data‑plane varies, the same Service can exhibit vastly different latency (e.g., 3 ms vs 40 ms) across clusters.
6. Public‑Internet Access Path
When a Pod contacts an external address, the flow differs:
Pod sends traffic to the external IP.
Traffic bypasses the Service DNAT chain.
Packet follows the host’s default route to the outbound NIC.
Source address is still the Pod IP.
POSTROUTING applies SNAT/MASQUERADE so the packet appears from the node IP.
Typical SNAT rule:
-A POSTROUTING -s 10.244.0.0/16 ! -d 10.244.0.0/16 -j MASQUERADEConsequences: large outbound traffic can also exhaust conntrack, and external‑service timeouts may trace back to node‑level networking.
7. Common Root Causes of Timeouts
conntrack table saturation : high insert_failed rate, count > 70 % of max, leads to random connection failures.
iptables rule explosion : many Services/Endpoints generate long rule chains; matching becomes slower and CPU‑intensive.
DNS jitter : insufficient CoreDNS replicas or missing NodeLocal DNSCache cause slow first‑lookups.
CNI MTU mismatch or encapsulation overhead : VXLAN/IPIP reduces effective MTU, causing fragmentation or loss.
NetworkPolicy or Service‑Mesh sidecars : add extra filtering or redirection layers, increasing latency and resource usage.
8. Production‑Grade Troubleshooting Methodology
The key is to start from the node, not the application log.
8.1 Layer 1 – DNS / Connection / Forwarding
Run inside the failing Pod:
time getent hosts backend-svc.default.svc.cluster.local
time curl -s -o /dev/null -w '%{time_namelookup} %{time_connect} %{time_starttransfer} %{time_total}
' http://backend-svcInterpret the timings to decide whether the bottleneck is name lookup, connection establishment, or data transfer.
8.2 Layer 2 – Node Rules & Connection State
iptables -t nat -L KUBE-SERVICES -n -v
iptables -t nat -L KUBE-NODEPORTS -n -v
conntrack -S
ss -sCheck for abnormal hit counts, near‑full conntrack usage, and socket state accumulation (TIME_WAIT, SYN‑SENT).
8.3 Layer 3 – Packet Path Verification
tcpdump -i any host 10.96.32.15 or host 10.244.2.18Look for SYN packets, SYN‑ACK replies, retransmissions, and verify that DNAT addresses match expectations.
8.4 Layer 4 – Routing, MTU, CNI
ip route
ip addr
ip linkMissing routes, broken tunnel devices, or MTU mismatches often surface only at this layer.
9. Engineering‑Level Optimizations
9.1 Choosing the Right kube‑proxy Mode
iptables : simple, compatible, suitable for small‑to‑medium clusters.
IPVS : kernel‑level load balancer, efficient look‑ups, better for medium‑to‑large clusters.
eBPF (Cilium) : lowest latency, strong observability, ideal for large, latency‑critical workloads.
9.2 Reducing Unnecessary NAT
Use Headless Services for stateful workloads (Kafka, etcd, Zookeeper) or direct Pod‑to‑Pod communication to bypass Service NAT and lower conntrack pressure.
9.3 Controlling Short‑Connection Storms
Enable connection pools and keep‑alive in Java/RPC clients.
Prefer HTTP/2 or connection reuse for HTTP services.
Cache DNS results locally.
9.4 Node Role Isolation
Separate Ingress nodes from core business nodes, and schedule heavy outbound workloads on dedicated nodes to avoid hotspot pressure on conntrack.
9.5 DNS Local Caching
Scale CoreDNS horizontally and enable NodeLocal DNSCache to cut CoreDNS QPS by ~60 % in the case study.
10. Concrete Production Configurations
10.1 Switch kube‑proxy to IPVS
apiVersion: v1
kind: ConfigMap
metadata:
name: kube-proxy
namespace: kube-system
data:
config.conf: |
apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration
mode: "ipvs"
ipvs:
scheduler: "rr"
conntrack:
maxPerCore: 32768
min: 13107210.2 Node sysctl tuning template
net.netfilter.nf_conntrack_max = 524288
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 26214410.3 One‑click evidence‑collection script
#!/usr/bin/env bash
set -euo pipefail
echo "==== basic ===="
date
hostname
uname -r
echo "==== conntrack ===="
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
conntrack -S || true
echo "==== socket summary ===="
ss -s
echo "==== iptables lines ===="
iptables-save | wc -l
echo "==== kube service nat rules ===="
iptables -t nat -L KUBE-SERVICES -n -v | head -n 80
echo "==== route ===="
ip route
echo "==== interface mtu ===="
ip link11. Full‑Scale Case Study – Fixing the Service Timeout
Root‑cause snapshot:
Hotspot order-service generated massive short connections.
All calls to user-service went through a ClusterIP, incurring NAT.
CoreDNS lacked local caching, causing high query volume.
kube‑proxy ran in iptables mode with a large rule set.
Ingress and business Pods shared the same nodes.
Remediation actions:
Raise nf_conntrack_max to 524288.
Switch kube‑proxy from iptables to IPVS.
Enable Java connection pools and keep‑alive.
Deploy NodeLocal DNSCache.
Isolate Ingress nodes from core business nodes.
Use Headless Services for Kafka, Nacos, etc., to avoid unnecessary NAT.
Results:
Service P99 latency dropped from 1.2 s to 38 ms. insert_failed metric fell to zero.
CoreDNS QPS reduced by ~60 %.
Node‑level network jitter converged dramatically.
The key lesson: production optimization is a systematic reduction of every unnecessary layer in the request path.
12. Design Guidance for Architects
Keep iptables when: cluster is small, Service count is limited, load is balanced, and stability outweighs performance.
Prefer IPVS when: Service/Endpoint count grows, rule refresh latency rises, or high‑peak connection delays appear.
Evaluate eBPF/Cilium when: ultra‑low tail latency, deep observability, or reduction of iptables complexity is required, especially with Service‑Mesh sidecars.
Upgrade only when the existing data‑plane becomes a bottleneck for business growth.
13. Practical Troubleshooting Checklist
Measure DNS resolution time.
Check conntrack_count / conntrack_max.
Inspect conntrack -S for insert_failed and drop.
Verify kube‑proxy mode (iptables vs IPVS vs eBPF).
Count iptables rules with iptables-save | wc -l.
Identify hotspot nodes handling disproportionate ingress/egress.
Validate cross‑node routes, tunnel devices, and MTU settings.
Detect extra paths introduced by Service‑Mesh or NetworkPolicy.
Confirm whether the application generates short‑connection storms.
Completing these steps narrows most K8s networking issues to a small, actionable scope.
14. Final Takeaway – Build a "Link Awareness" Mindset
Understanding that Pod ↔ CNI ↔ kube‑proxy ↔ conntrack ↔ DNS ↔ Node routing ↔ NAT forms a layered chain lets teams move from symptom‑driven debugging to systematic, measurable remediation.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
