How Qunar Scales Kubernetes Networking with Calico: Architecture & Lessons
This article details Qunar's adoption of Calico as a CNI solution, covering its three‑layer architecture, core components, large‑scale deployment practices, IPAM behavior, pod‑to‑pod and pod‑to‑external traffic flows, encountered issues, and the overall benefits for a production Kubernetes environment.
Introduction
Calico is an open‑source Container Network Interface (CNI) project that provides L3 networking for containerized workloads. The article explains how Qunar leverages Calico to deliver network functionality for its Kubernetes clusters.
Calico Architecture
Calico implements a three‑layer data‑center network that can serve as a CNI plugin for Kubernetes or integrate with OpenStack. It uses BGP, IPIP and related protocols to enable communication among VMs, containers and bare‑metal hosts.
Core components include:
Felix – the Calico agent running on every host, configuring routes, ACLs and ensuring container connectivity.
etcd – a distributed key/value store that holds network metadata and guarantees consistency.
BGP client (Bird) – distributes the routes written by Felix to the Calico network.
BGP Route Reflector (RR) – centralizes route distribution, replacing a full mesh in large deployments and reducing resource consumption.
Each host runs a high‑performance vRouter in the Linux kernel; routes are propagated via BGP, allowing every node to learn the IP blocks of other nodes.
Calico in Qunar
Qunar required direct IP reachability for services such as Nginx upstreams, which Kubernetes alone could not guarantee. After evaluating Flannel, Cilium and Calico, the team selected Calico in 2017. Today Calico provides networking for more than 4,000 pods in Qunar's ESAAS dedicated clusters and in production business clusters.
Key reasons for the choice:
Pure L3 design without an overlay, saving CPU cycles and avoiding ARP storms.
Both pod IPs and service IPs are routable without NAT, enabling direct IP communication.
Scalable to large‑scale deployments.
Qunar runs Calico as a DaemonSet on every Kubernetes node. To support massive scale, the RR mode is used: rack switches act as BGP peers, each node is configured with the same AS number, and routes to local pod IP blocks are announced to the switches, which then redistribute them across the AS.
Calico IPAM
Calico divides the global IP pool into blocks. Each node receives a block and allocates pod IPs from it. Nodes learn each other's blocks via BGP and install the routes locally, ensuring every node knows the IP ranges owned by its peers.
Pod‑to‑Pod Communication
When a pod on one node contacts a pod on another node, the packet traverses the host's veth pair, hits the host routing table (populated by BGP), and is forwarded to the destination node where the target pod's veth interface receives it. The article includes a diagram of this flow.
Pod‑to‑External Communication
External traffic leaves the Kubernetes node via the bonded 10 Gbps NICs, reaches the rack switches (IBGP), then the core switches (EBGP), and finally the broader data‑center network. Traceroute output shows each hop from the container host to an external service such as GitLab.
Issues Encountered
Calico IPAM assigns IPs using the following logic:
If a node already has a bound IP block, allocate from that block.
If no IP is available in the bound block, allocate an unbound block from the IP pool and then allocate an IP.
If step 2 fails, search all IP blocks for an unused IP.
This can lead to "IP borrowing" when a new node joins after all blocks are allocated: the node receives an IP from an already‑used block, but Calico creates a BGP blackhole route for that block, making the IP unreachable. Prior to Calico 3.14 the workaround was to monitor IP block usage and add new pools; from 3.14 onward the strict IP affinity option can be enabled to disable IP borrowing.
Conclusion
Calico provides a simple, efficient, and stable L3 networking solution suitable for large‑scale production environments. Advanced features such as the Typha mode further extend its capability for massive container platforms, and ongoing configuration tuning can address evolving business needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
