Cloud Native 28 min read

Kubernetes Networking Unpacked: How a Service Timeout Reveals iptables‑CNI Collaboration

A real‑world Service timeout in a high‑traffic e‑commerce cluster exposed a saturated conntrack table, prompting a step‑by‑step dissection of Pods, Services, iptables, conntrack, CNI plugins, DNS and NetworkPolicy, and culminating in concrete production‑grade remediation tactics.

Ray's Galactic Tech

Apr 26, 2026

Kubernetes Networking Unpacked: How a Service Timeout Reveals iptables‑CNI Collaboration

1. Incident Trigger – Service Timeout

During a flash‑sale load test (~30 k QPS) the order-service calling user-service began to see RPC timeouts of 1‑5 seconds. CPU, memory, GC and DB pools were normal, but some nodes showed severe latency while others were fine. Initial guesses (thread‑pool exhaustion, small Dubbo/Feign pools, CoreDNS jitter, Istio sidecar overhead) proved wrong.

Running conntrack -S and inspecting /proc/sys/net/netfilter/nf_conntrack_count revealed a continuously growing insert_failed counter, a rising drop counter, and nf_conntrack_count approaching nf_conntrack_max. The conclusion: the node’s connection‑tracking table was full, preventing new NAT state creation and causing random timeouts and retry storms.

2. Building the Full‑Picture – Who Does What in K8s Networking

The common simplification “Pod network is CNI, Service is kube‑proxy” hides many moving parts. In production the responsibilities are:

Linux Network Namespace : isolates each Pod’s network stack; the key question is why eth0 can exist independently.

veth pair : bridges the Pod namespace and the host namespace; it answers how packets leave the container.

Bridge / routing table : performs host‑side forwarding and routing; it decides whether traffic stays on‑node or goes cross‑node.

CNI plugin : allocates Pod IPs and injects routes, tunnels or policies (e.g., Calico, Flannel, Cilium) to make Pods reachable.

kube‑proxy : installs iptables/IPVS/eBPF rules that translate a Service’s virtual IP to backend Pod IPs (ClusterIP/NodePort).

netfilter/iptables : executes NAT and filtering at kernel hooks (PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING).

conntrack : records each connection’s NAT mapping so that return packets can be reverse‑translated; when the table fills, timeouts appear.

CoreDNS : resolves Service names to ClusterIP; latency here can make the first request appear slow.

NetworkPolicy : L3/L4 access control that can silently reject traffic.

In short, K8s networking is a data‑path composed of multiple layers, not a single component.

3. Netfilter, iptables and conntrack – How They Work Together

Linux provides five netfilter hook points: PREROUTING, INPUT, FORWARD, OUTPUT, POSTROUTING. iptables is a user‑space tool that writes rules into the netfilter tables. The two tables most relevant to K8s are: nat – performs DNAT/SNAT for Service translation and outbound masquerading. filter – implements access control, often driven by NetworkPolicy.

When a Service request arrives, iptables rewrites the destination from the ClusterIP to the selected Pod IP (DNAT). However, a single address rewrite is insufficient; the kernel must remember the mapping. conntrack stores this state so that the reply packet can be reverse‑NATed. If conntrack cannot allocate a new entry, the packet is dropped, manifesting as random timeouts.

4. End‑to‑End Service Call Flow

Using a typical request from pod‑a to Service backend‑svc (ClusterIP 10.96.32.15:80) the steps are:

DNS lookup : pod‑a queries CoreDNS, which returns the ClusterIP.

Packet leaves the Pod : the packet travels from eth0 through the veth pair into the host’s bridge or routing stack.

iptables DNAT (PREROUTING/OUTPUT): kube‑proxy rules match the ClusterIP and rewrite the destination to the backend Pod IP (e.g., 10.244.2.18:8080).

Routing decision : if the backend Pod is on the same node, the packet is forwarded via the bridge/veth; otherwise the CNI‑provided route (BGP, VXLAN, IP‑in‑IP) sends it to the remote node.

Backend Pod reply : conntrack uses the stored NAT entry to reverse‑translate the source address, so the client still sees backend‑svc as the endpoint.

The diagram in the original article (omitted here) visualizes these hops.

5. What Actually Drives CNI

Many assume iptables creates Pod‑to‑Pod connectivity, but the reality is that CNI sets up the underlying network fabric (IP allocation, bridge or overlay, routing). kube‑proxy only adds the Service abstraction on top.

CNI responsibilities during Pod creation:

Create a network‑namespace‑side interface.

Create the matching veth pair.

Allocate a Pod IP.

Attach the Pod to a bridge or routing device.

Inject routes, neighbor entries, tunnels or policy rules on the host.

Typical CNI implementations differ in data‑plane:

Flannel (VXLAN) : simple, easy to use, but adds encapsulation overhead and limited observability.

Calico (BGP / VXLAN / IPIP) : flexible, strong NetworkPolicy support, slightly higher configuration complexity.

Cilium (eBPF) : high performance, rich observability, can replace kube‑proxy, but steeper learning curve.

Because the data‑plane varies, the same Service can exhibit vastly different latency (e.g., 3 ms vs 40 ms) across clusters.

6. Public‑Internet Access Path

When a Pod contacts an external address, the flow differs:

Pod sends traffic to the external IP.

Traffic bypasses the Service DNAT chain.

Packet follows the host’s default route to the outbound NIC.

Source address is still the Pod IP.

POSTROUTING applies SNAT/MASQUERADE so the packet appears from the node IP.

Typical SNAT rule:

-A POSTROUTING -s 10.244.0.0/16 ! -d 10.244.0.0/16 -j MASQUERADE

Consequences: large outbound traffic can also exhaust conntrack, and external‑service timeouts may trace back to node‑level networking.

7. Common Root Causes of Timeouts

conntrack table saturation : high insert_failed rate, count > 70 % of max, leads to random connection failures.

iptables rule explosion : many Services/Endpoints generate long rule chains; matching becomes slower and CPU‑intensive.

DNS jitter : insufficient CoreDNS replicas or missing NodeLocal DNSCache cause slow first‑lookups.

CNI MTU mismatch or encapsulation overhead : VXLAN/IPIP reduces effective MTU, causing fragmentation or loss.

NetworkPolicy or Service‑Mesh sidecars : add extra filtering or redirection layers, increasing latency and resource usage.

8. Production‑Grade Troubleshooting Methodology

The key is to start from the node, not the application log.

8.1 Layer 1 – DNS / Connection / Forwarding

Run inside the failing Pod:

time getent hosts backend-svc.default.svc.cluster.local
time curl -s -o /dev/null -w '%{time_namelookup} %{time_connect} %{time_starttransfer} %{time_total}
' http://backend-svc

Interpret the timings to decide whether the bottleneck is name lookup, connection establishment, or data transfer.

8.2 Layer 2 – Node Rules & Connection State

iptables -t nat -L KUBE-SERVICES -n -v
iptables -t nat -L KUBE-NODEPORTS -n -v
conntrack -S
ss -s

Check for abnormal hit counts, near‑full conntrack usage, and socket state accumulation (TIME_WAIT, SYN‑SENT).

8.3 Layer 3 – Packet Path Verification

tcpdump -i any host 10.96.32.15 or host 10.244.2.18

Look for SYN packets, SYN‑ACK replies, retransmissions, and verify that DNAT addresses match expectations.

8.4 Layer 4 – Routing, MTU, CNI

ip route
ip addr
ip link

Missing routes, broken tunnel devices, or MTU mismatches often surface only at this layer.

9. Engineering‑Level Optimizations

9.1 Choosing the Right kube‑proxy Mode

iptables : simple, compatible, suitable for small‑to‑medium clusters.

IPVS : kernel‑level load balancer, efficient look‑ups, better for medium‑to‑large clusters.

eBPF (Cilium) : lowest latency, strong observability, ideal for large, latency‑critical workloads.

9.2 Reducing Unnecessary NAT

Use Headless Services for stateful workloads (Kafka, etcd, Zookeeper) or direct Pod‑to‑Pod communication to bypass Service NAT and lower conntrack pressure.

9.3 Controlling Short‑Connection Storms

Enable connection pools and keep‑alive in Java/RPC clients.

Prefer HTTP/2 or connection reuse for HTTP services.

Cache DNS results locally.

9.4 Node Role Isolation

Separate Ingress nodes from core business nodes, and schedule heavy outbound workloads on dedicated nodes to avoid hotspot pressure on conntrack.

9.5 DNS Local Caching

Scale CoreDNS horizontally and enable NodeLocal DNSCache to cut CoreDNS QPS by ~60 % in the case study.

10. Concrete Production Configurations

10.1 Switch kube‑proxy to IPVS

apiVersion: v1
kind: ConfigMap
metadata:
  name: kube-proxy
  namespace: kube-system
data:
  config.conf: |
    apiVersion: kubeproxy.config.k8s.io/v1alpha1
    kind: KubeProxyConfiguration
    mode: "ipvs"
    ipvs:
      scheduler: "rr"
    conntrack:
      maxPerCore: 32768
      min: 131072

10.2 Node sysctl tuning template

net.netfilter.nf_conntrack_max = 524288
net.ipv4.ip_local_port_range = 10240 65535
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 15
net.core.somaxconn = 65535
net.ipv4.tcp_max_syn_backlog = 262144

10.3 One‑click evidence‑collection script

#!/usr/bin/env bash
set -euo pipefail

echo "==== basic ===="
date
hostname
uname -r

echo "==== conntrack ===="
cat /proc/sys/net/netfilter/nf_conntrack_count
cat /proc/sys/net/netfilter/nf_conntrack_max
conntrack -S || true

echo "==== socket summary ===="
ss -s

echo "==== iptables lines ===="
iptables-save | wc -l

echo "==== kube service nat rules ===="
iptables -t nat -L KUBE-SERVICES -n -v | head -n 80

echo "==== route ===="
ip route

echo "==== interface mtu ===="
ip link

11. Full‑Scale Case Study – Fixing the Service Timeout

Root‑cause snapshot:

Hotspot order-service generated massive short connections.

All calls to user-service went through a ClusterIP, incurring NAT.

CoreDNS lacked local caching, causing high query volume.

kube‑proxy ran in iptables mode with a large rule set.

Ingress and business Pods shared the same nodes.

Remediation actions:

Raise nf_conntrack_max to 524288.

Switch kube‑proxy from iptables to IPVS.

Enable Java connection pools and keep‑alive.

Deploy NodeLocal DNSCache.

Isolate Ingress nodes from core business nodes.

Use Headless Services for Kafka, Nacos, etc., to avoid unnecessary NAT.

Results:

Service P99 latency dropped from 1.2 s to 38 ms. insert_failed metric fell to zero.

CoreDNS QPS reduced by ~60 %.

Node‑level network jitter converged dramatically.

The key lesson: production optimization is a systematic reduction of every unnecessary layer in the request path.

12. Design Guidance for Architects

Keep iptables when: cluster is small, Service count is limited, load is balanced, and stability outweighs performance.

Prefer IPVS when: Service/Endpoint count grows, rule refresh latency rises, or high‑peak connection delays appear.

Evaluate eBPF/Cilium when: ultra‑low tail latency, deep observability, or reduction of iptables complexity is required, especially with Service‑Mesh sidecars.

Upgrade only when the existing data‑plane becomes a bottleneck for business growth.

13. Practical Troubleshooting Checklist

Measure DNS resolution time.

Check conntrack_count / conntrack_max.

Inspect conntrack -S for insert_failed and drop.

Verify kube‑proxy mode (iptables vs IPVS vs eBPF).

Count iptables rules with iptables-save | wc -l.

Identify hotspot nodes handling disproportionate ingress/egress.

Validate cross‑node routes, tunnel devices, and MTU settings.

Detect extra paths introduced by Service‑Mesh or NetworkPolicy.

Confirm whether the application generates short‑connection storms.

Completing these steps narrows most K8s networking issues to a small, actionable scope.

14. Final Takeaway – Build a "Link Awareness" Mindset

Understanding that Pod ↔ CNI ↔ kube‑proxy ↔ conntrack ↔ DNS ↔ Node routing ↔ NAT forms a layered chain lets teams move from symptom‑driven debugging to systematic, measurable remediation.

Kubernetes Networking Service iptables CNI conntrack

Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

1. Incident Trigger – Service Timeout

2. Building the Full‑Picture – Who Does What in K8s Networking

3. Netfilter, iptables and conntrack – How They Work Together

4. End‑to‑End Service Call Flow

5. What Actually Drives CNI

6. Public‑Internet Access Path

7. Common Root Causes of Timeouts

8. Production‑Grade Troubleshooting Methodology

8.1 Layer 1 – DNS / Connection / Forwarding

8.2 Layer 2 – Node Rules & Connection State

8.3 Layer 3 – Packet Path Verification

8.4 Layer 4 – Routing, MTU, CNI

9. Engineering‑Level Optimizations

9.1 Choosing the Right kube‑proxy Mode

9.2 Reducing Unnecessary NAT

9.3 Controlling Short‑Connection Storms

9.4 Node Role Isolation

9.5 DNS Local Caching

10. Concrete Production Configurations

10.1 Switch kube‑proxy to IPVS

10.2 Node sysctl tuning template

10.3 One‑click evidence‑collection script

11. Full‑Scale Case Study – Fixing the Service Timeout

12. Design Guidance for Architects

13. Practical Troubleshooting Checklist

14. Final Takeaway – Build a "Link Awareness" Mindset

Ray's Galactic Tech

How this landed with the community

Was this worth your time?

0 Comments

8.1 Layer 1 – DNS / Connection / Forwarding

8.2 Layer 2 – Node Rules & Connection State

8.3 Layer 3 – Packet Path Verification

8.4 Layer 4 – Routing, MTU, CNI