Mastering Kubernetes Container Networking: Architecture, CNI, and Real‑World Solutions
This comprehensive guide explains Kubernetes' layered architecture, core components, and container networking fundamentals, covering IP‑per‑Pod, CNI plugins, Flannel and Calico implementations, DNS, Service types, Ingress, and public‑cloud networking strategies for multi‑cluster deployments.
Kubernetes Container Networking – A Comprehensive Guide
Kubernetes originated from Google's internal Borg system and provides a full‑stack container orchestration platform with multi‑layer security, multi‑tenant support, service discovery, built‑in load balancing, fault detection, self‑healing, rolling updates, auto‑scaling, and fine‑grained resource quotas.
K8s Layered Architecture
Kubernetes follows a Linux‑like layered design. The main layers from bottom to top are:
Core Layer : Provides the API and plugin‑based execution environment.
Application Layer : Handles deployment of stateless, stateful, batch, and cluster‑wide workloads and routing (service discovery, DNS, Service Mesh).
Management Layer : System metrics, automation (auto‑scaling, dynamic provisioning) and policy management (RBAC, Quota, PSP, NetworkPolicy) plus Service Mesh.
Interface Layer : kubectl CLI, client SDKs, and cluster federation.
Ecosystem : A large set of extensions and CNCF‑hosted projects.
Overall Architecture
Kubernetes inherits concepts such as Pod, Service, Label, and single‑Pod IP from Borg. Its core components are:
etcd – stores the entire cluster state.
apiserver – the sole entry point for resource operations, providing authentication, authorization, admission control, and API discovery.
controller‑manager – maintains cluster state (fault detection, auto‑scaling, rolling updates).
scheduler – assigns Pods to nodes based on scheduling policies.
kubelet – manages container lifecycles on each node and handles CSI (volumes) and CNI (networking).
container runtime – runs containers and implements the CRI.
kube‑proxy – implements Service discovery and load balancing inside the cluster.
Recommended plugins that have become CNCF projects include CoreDNS, Ingress Controllers, Prometheus, Dashboard, and Federation.
Master Architecture
Node Architecture
Core Concepts
Node : Provides compute resources for Pods; each Node runs a kubelet to manage its containers.
Pod : The smallest deployable unit, can host multiple containers that share a network namespace and filesystem. Pods are categorized as long‑running, batch, daemon, or stateful, corresponding to Deployments, Jobs, DaemonSets, and StatefulSets.
Service : Exposes a stable virtual IP for a set of Pods, enabling load‑balanced access.
Labels & Selectors : Key‑value pairs used to identify and select Pods.
Namespace : Virtual isolation within a cluster; default and kube‑system are created automatically.
Secret : Stores sensitive data such as passwords and keys.
Annotation : Arbitrary non‑identifying metadata attached to objects.
K8s Container Network
Kubernetes assigns each Pod a unique IP (IP‑per‑Pod). All containers in a Pod share the same network namespace, allowing direct communication without NAT.
Network communication is divided into:
Intra‑Pod container communication
Pod‑to‑Pod communication (same Node, different Node)
Pod‑to‑Service communication
External‑to‑Service communication
Container Network Foundations
Pods use Linux cgroups for resource limits and namespaces for isolation. Each network namespace has its own stack (interfaces, routing table, iptables). Implementations create virtual interfaces (veth pairs) or use MACVLAN/IPVLAN.
Virtual Bridge
Creates a veth pair; one end lives in the Pod as eth0, the other attaches to a Linux bridge or OVS bridge. This method may add performance overhead.
Multiplexing
MACVLAN assigns a unique MAC to each virtual interface; IPVLAN shares a MAC but uses multiple IPs. Both are supported in the Linux kernel (MACVLAN since 3.9, IPVLAN since 3.19).
Hardware Switching
SR‑IOV splits a physical NIC into multiple virtual functions, each with its own PCI device, VLAN, and QoS. It offers near‑bare‑metal performance but is rarely used in public clouds.
In practice, most deployments use the virtual‑bridge model unless performance or security policies dictate MACVLAN/IPVLAN.
Container Network Interface (CNI)
CNI, originated by CoreOS, defines the standard for Kubernetes network plugins. When a container is created, the runtime first creates a network namespace, then invokes the CNI plugin to configure networking.
CNI Plugin : Provides two functions – AddNetwork(netConfig, rtConf) to configure the network and DelNetwork(netConfig, rtConf) to clean it up.
IPAM Plugin : Allocates IP addresses, gateways, and DNS (e.g., host‑local or DHCP).
The kubelet watches the apiserver for new Pods, then runs the CNI binary specified in the CNI configuration to set up the Pod's network.
Pod‑to‑Pod Communication
Two main approaches exist:
Tunnel solutions : Weave (UDP broadcast), Open vSwitch (VXLAN/GRE), Flannel (VXLAN/UDP), Racher (IPsec).
Routing solutions : Calico (BGP‑based routing) and MACVLAN (layer‑2 isolation).
Below are details of the most common implementations.
Flannel
Flannel is the default CNI plugin, using an L3 overlay. Each node receives a subnet; Pods get IPs from that subnet. The data plane can be VXLAN, UDP, or host‑gw.
Key components:
docker0 : Linux bridge; each Pod creates a veth pair (Pod eth0 ↔ cni0).
flannel.1 : Overlay device handling VXLAN encapsulation.
flanneld : Agent that obtains a subnet from etcd and watches the cluster for changes.
Data flow when a Pod on Node A talks to a Pod on Node B:
Pod sends traffic to docker0.
docker0 forwards it to the flannel.1 tunnel device.
flannel.1 queries flanneld for the remote tunnel endpoint and encapsulates the packet.
The encapsulated packet traverses the physical network to Node B.
Node B’s flannel.1 decapsulates and forwards to its docker0.
docker0 delivers the packet to the destination Pod.
Flannel defaults to VXLAN; older kernels fall back to UDP. The UDP encapsulation format is similar to VXLAN but without the VXLAN header.
Calico
Calico provides network policy and does not rely on tunnels or NAT. It converts all traffic to layer‑3 and uses host routing.
Key components:
Felix : Runs on each host, managing interfaces, routes, ARP, ACLs, and synchronizing state.
etcd : Stores network metadata (shared with Kubernetes).
BGP Client (BIRD) : Listens to routes injected by Felix and advertises them via BGP.
BGP Route Reflector : Reduces the number of BGP sessions in large clusters.
Calico supports two networking modes:
IPIP : Encapsulates traffic in a tunnel; suitable for cross‑subnet Pod communication.
BGP : Uses native routing; offers higher performance but requires BGP‑compatible network infrastructure.
DNS
Kubernetes runs a DNS deployment (CoreDNS) in the kube‑system namespace. kube‑dns consists of three containers:
KubeDNS : Watches Services and Endpoints and updates SkyDNS.
SkyDNS : Performs DNS resolution on port 10053 and exposes metrics on 10055.
dnsmasq‑nanny : Starts dnsmasq and restarts it when configuration changes.
Sidecar : Provides health checks and DNS metrics on port 10054.
Service
A Service abstracts a set of Pods, giving them a stable virtual IP and DNS name. kube‑proxy on each Node implements the Service VIP using one of three proxy modes.
Userspace Proxy Mode
kube‑proxy watches the apiserver, opens a random local port for each Service, and forwards traffic from the Service ClusterIP to that port, which then forwards to a backend Pod based on session affinity.
Iptables Proxy Mode
When a Service is created, kube‑proxy adds two iptables rules: one that DNATs <ClusterIP,Port> to the backend Pods, and another that selects the specific Pod. This mode is fast because it runs entirely in the kernel, but it does not retry failed Pods; readiness probes are required.
-A KUBE-SVC-XXXXX -m comment --comment "default/myservice:" -m statistic --mode random --probability 0.3333 -j KUBE-SEP-YYYYY
-A KUBE-SEP-YYYYY -p tcp -m comment --comment "default/myservice:" -m tcp -j DNAT --to-destination 10.244.1.7:9376IPVS Proxy Mode
IPVS uses a hash table for faster rule matching and supports multiple load‑balancing algorithms (rr, lc, dh, sh, sed, nq). It requires the ipvs kernel module on each Node.
Ingress
Ingress exposes Services outside the cluster, providing HTTP routing, SSL termination, and load balancing. An Ingress controller (e.g., Nginx or Traefik) watches Ingress resources and configures the underlying load balancer.
External Access to Services
Service types:
ClusterIP : Internal virtual IP only.
NodePort : Exposes the Service on a static port on each Node.
LoadBalancer : Provisions a cloud provider LB and maps it to a NodePort.
ExternalName : Maps to an external DNS CNAME.
Public‑Cloud Container Network Solutions
Two common patterns:
VPC Routing : Assign each Node a subnet and use instance routing (similar to Flannel).
ENI (Elastic Network Interface) : Allocate an ENI or secondary IP per Pod, giving Pods VPC‑routable addresses.
ENI‑based designs must consider quota limits (max ENIs per VM, max secondary IPs per ENI) and typically use a "multiple ENIs with multiple secondary IPs" strategy.
LB FastPath Evolution
To avoid the overhead of centralized LB gateways, FastPath lets the LB complete the TCP handshake and then notifies the client and real server to communicate directly (Full‑NAT style), reducing latency and bottlenecks.
Baidu Cloud Native Container Network Solution
Requirements
East‑west Pod‑to‑Pod communication (Pod CIDR 11.0.0.0/8).
Pod‑to‑non‑container host communication.
Cross‑datacenter QoS (DSCP marking).
No multicast, IPv6, security groups, DHCP, or external DNS.
Service
Three IDC scenarios:
External‑to‑internal: EBGW + BFE (layer‑7).
Internal‑to‑external: BigNAT.
Internal‑to‑internal: BNS service discovery, IBGW VIP.
Kuryr Integration
Kuryr connects Kubernetes to OpenStack Neutron. It maps Pods to Neutron Ports and Services to Neutron Load Balancers.
Kuryr consists of a controller (creates Neutron resources) and a CNI driver (runs on each worker node, creates veth pairs, and binds them to Neutron ports).
Multi‑Cloud Container Network Solution
Federation can synchronize resources across clusters, but it is still immature. A practical approach is to expose Services via cloud‑provider LoadBalancers and use VPC routing or dedicated inter‑cloud links for Pod‑to‑Pod communication.
DNS can be handled by deploying a local DNS cache (dnsmasq/unbound) on each Node that forwards queries to the appropriate cluster’s kube‑dns based on domain suffixes.
Overall, the combination of CNI plugins (Flannel, Calico), cloud‑native Service types, and proper routing or ENI strategies enables robust, scalable, and secure container networking across on‑premise, public‑cloud, and multi‑cloud environments.
Linux Cloud Computing Practice
Welcome to Linux Cloud Computing Practice. We offer high-quality articles on Linux, cloud computing, DevOps, networking and related topics. Dive in and start your Linux cloud computing journey!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
