Cloud Computing 36 min read

Evolution of Ctrip’s Cloud Network Architecture: From VLAN to SDN and Cloud‑Native Solutions

This article details Ctrip’s multi‑generation cloud networking solutions, tracing the progression from early VLAN‑based Layer‑2 designs through SDN‑enabled large‑scale networks to container‑centric and cloud‑native architectures, highlighting hardware topologies, software integrations, and operational lessons for large‑scale data‑center environments.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Evolution of Ctrip’s Cloud Network Architecture: From VLAN to SDN and Cloud‑Native Solutions

Author Bio Zhao Yanan, senior architect at Ctrip Cloud Platform. Joined Ctrip Cloud Computing in 2016, working on OpenStack, SDN, container networking (Mesos, K8S), container image storage, distributed storage, and now leads the Ctrip Cloud Network & Storage Team focusing on network and distributed storage R&D.

The article introduces several generations of Ctrip’s network solutions for private and public clouds since the cloud computing era, aiming to provide references for peers designing and maintaining networks of similar scale.

1. Introduction to Ctrip Cloud Platform

Ctrip Cloud team was founded around 2013, initially building a private cloud on OpenStack, later developing its own bare‑metal system integrated with OpenStack, and in recent years deploying Mesos and K8S platforms and integrating public clouds.

All cloud services are unified under CDOS – Ctrip Data Center Operation System , a hybrid‑cloud platform that manages compute, network, and storage resources across private and public clouds.

Fig 1. Ctrip Data Center Operation System (CDOS)

In the private cloud, resources include VMs, bare‑metal hosts, and containers; in the public cloud, Ctrip integrates with AWS, Tencent Cloud, UCloud, etc., exposing VMs and containers via the CDOS API.

Network Evolution Timeline

Fig 2. Timeline of the Network Architecture Evolution

The early OpenStack deployment used a simple VLAN Layer‑2 network with a traditional three‑tier hardware network. As scale grew, a Spine‑Leaf hardware architecture and a large‑scale SDN‑based Layer‑2 network were introduced in 2016. In 2017, container platforms (Mesos, K8S) were added, and by 2019 the team began exploring cloud‑native solutions.

2. VLAN‑Based Layer‑2 Network

Started in 2013 with OpenStack private cloud providing VMs and bare‑metal hosts.

2.1 Requirements

High performance (low latency, high throughput).

Layer‑2 isolation to avoid broadcast storms.

Routable IPs for instances (no tunnelling).

Security can be relaxed in favor of performance, compensated by external firewalls.

2.2 Solution: OpenStack Provider Network Model

The team selected the OpenStack provider network model, where the host’s internal Layer‑2 switch can be OVS, Linux Bridge, or vendor‑specific, and the external network uses hardware switches for Layer‑2 and routers for Layer‑3 without overlay encapsulation.

Fig 3. OpenStack Provider Network Model

Key characteristics:

Gateway resides on hardware switches.

Instance IPs are routable, no tunnelling required.

Better performance than pure‑software solutions.

Uses VLAN for Layer‑2 isolation, OVS (ML2) as the L2 agent, no L3 agent, no DHCP, no floating IP, and security groups are omitted for performance.

2.3 Hardware Network Topology

Fig 4. Physical Network Topology in the Datacenter

Features:

Each server has two NICs connected to two top‑of‑rack switches for high availability.

Access and aggregation layers use Layer‑2 switching; core layer uses Layer‑3 routing.

OpenStack gateways are configured on core routers.

Firewalls are directly attached to core routers.

2.4 Host‑Internal Network Topology

Fig 5. Designed Virtual Network Topology within a Compute Node

Key points:

Two OVS bridges (br‑int and br‑bond) are linked; the two physical NICs are bonded via OVS.

Management IP is assigned to br‑bond.

All instance ports attach to br‑int.

Cross‑subnet communication traverses br‑int → br‑bond → physical NIC → switch → router → back, totaling 18 hops (vs. 24 hops in the legacy OpenStack model that inserts a Linux bridge for security‑group processing).

2.5 Summary of First‑Generation Network

Advantages

Removed unnecessary OpenStack components (L3 agent, HDCP agent, Neutron meta agent), reducing operational complexity.

Simplified host‑internal topology, shortening forwarding paths and reducing latency.

Hardware‑based gateway improves performance.

Routable instance IPs simplify monitoring and tracing.

Disadvantages

Security groups were removed, weakening host‑level firewalling (partially compensated by external firewalls).

Network provisioning still required manual configuration on core switches, posing operational risk.

3. SDN‑Based Large‑Scale Layer‑2 Network

The first‑generation design became insufficient as scale grew and micro‑service adoption increased.

3.1 New Challenges

Three‑tier hardware architecture limited scalability and created a performance bottleneck at the core.

VLAN broadcast storms and the 4096 VLAN limit.

1 Gbps NICs on hosts became a bottleneck.

Multi‑tenant and VPC requirements after acquisitions.

Need for automated network configuration and reduced operational risk.

3.2 Solution: OpenStack + SDN

A hybrid software‑hardware solution was designed, evolving from a simple Layer‑2 network to a large‑scale SDN‑enabled architecture.

Hardware Topology

Fig 7. Spine‑Leaf Topology in the New Datacenter

Spine‑Leaf provides full‑mesh connectivity, shorter forwarding paths (three hops between any two servers), better horizontal scalability, and higher fault tolerance.

Host NICs were upgraded to 10 Gbps/25 Gbps.

SDN Control and Data Plane

Data plane uses VxLAN; control plane uses MP‑BGP EVPN to synchronize state across devices. Gateways are distributed; each leaf acts as a gateway.

SDN Components and Implementation

Developed a proprietary SDN controller called Ctrip Network Controller (CNC) , a centralized controller managing all spine and leaf nodes and integrating with OpenStack Neutron via a plugin.

Neutron extensions:

Added ML2 and L3 plugins to integrate with CNC.

Redesigned port state machine to model both underlay and overlay.

New APIs for CNC interaction.

Monitoring panel (Fig 8) visualizes port states, indicating whether issues lie in underlay or overlay.

Fig 8. Monitoring Panel for Neutron Port States

3.3 Software + Hardware Network Topology

Fig 9. HW + SW Topology of the Designed SDN Solution

Key points:

Leaf devices form the boundary; underlay (VLAN) runs below leaf, overlay (VxLAN) runs above.

Underlay is controlled by Neutron, OVS, and the Neutron OVS agent; overlay is controlled by CNC.

Neutron and CNC communicate via the custom plugin.

3.4 Instance Creation Network Flow

Fig 10. Flow of Spawn An Instance

The process involves Nova requesting a port from Neutron, Neutron invoking CNC via the post‑commit hook, CNC programming the appropriate leaf switches, and finally achieving ACTIVE_ACTIVE status when both underlay and overlay are configured.

3.5 Summary

Hardware : Migration from three‑tier to Spine‑Leaf reduces latency, improves fault tolerance, and enables distributed gateways.

Software : CNC provides dynamic network configuration; integration with Neutron supports both VMs and bare‑metal hosts.

Multi‑Tenant Support : Hard‑multitenancy is now available.

4. Container and Hybrid‑Cloud Networking

Starting in 2017, Ctrip introduced container platforms (K8S, Mesos) on both private and public clouds, migrating workloads from VMs and bare‑metal to containers.

4.1 Private‑Cloud K8S Network Solution

4.1.1 Requirements

High‑performance, highly concurrent network APIs.

Fast creation and deletion of container networks.

Forward compatibility with existing systems; container IP must remain stable during pod migration.

4.1.2 Solution: Extend Existing SDN for Mesos/K8S

Neutron and CNC were extended to manage container networks. A custom Neutron CNI plugin was developed for K8S, reusing the existing OVS, CNC, and Neutron infrastructure.

Added label‑based network selection to decouple external systems from OpenStack specifics.

Implemented batch IP allocation, asynchronous APIs, and database optimizations.

Back‑ported features such as graceful OVS agent restart.

4.1.3 Container Drift (IP Preservation)

When a pod drifts to another host, the CNI plugin uses the pod’s label‑derived port name to retrieve the original port from Neutron, preserving the IP. The new host updates the port’s host_id, CNC removes the old leaf configuration and programs the new leaf.

Fig 11. Pod drifting with the same IP within a K8S cluster

4.1.4 Summary

Container networking was integrated into the existing SDN stack without changing the underlying infrastructure.

A unified IPAM now serves VMs, bare‑metal hosts, and containers.

Deployed across 4 availability zones, >500 nodes per zone, >20 000 instances (mostly containers), and up to 500+ pods per node.

4.2 Public‑Cloud K8S

For overseas deployments, Ctrip purchases VMs or bare‑metal from public‑cloud providers (e.g., AWS) and runs self‑managed K8S clusters. CDOS APIs abstract provider differences, enabling a hybrid‑cloud model.

4.2.2 AWS K8S Network Solution

EC2 instances serve as K8S nodes; a custom CNI plugin dynamically attaches/detaches ENIs to containers, inspired by Lyft and Netflix implementations.

A global IPAM within the VPC manages network resources, invoking AWS APIs for allocation and release. The CNI also supports floating IP attachment and maintains IP stability during pod migration.

Fig 13. K8S network solution on public cloud vendor (AWS)

4.2.3 Global VPC Topology

Fig 14. VPCs distributed over the globe

Private‑cloud VPCs exist in Shanghai and Nantong; public‑cloud VPCs are deployed in Seoul, Moscow, Frankfurt, California, Hong Kong, Melbourne, etc. Non‑overlapping IP ranges enable routed connectivity via dedicated links.

5. Cloud‑Native Solution Exploration

While the current architecture meets near‑term needs, emerging challenges include centralized IPAM bottlenecks, lack of local IPAM for containers, increased fault scope due to IP drift, switch table pressure from high container density, and growing Layer‑3‑7 firewall requirements.

5.1 Cilium Overview

Cilium leverages eBPF/XDP for high‑performance networking and security, requiring Linux kernel 4.8+ (preferably 4.14+). It provides L4‑L7 security policies, O(1) policy enforcement, dual‑stack support, and can run on top of Flannel.

Fig 15. Cilium

5.2 Host Networking (Within a Host)

Cilium creates a cilium_host <---> cilium_net veth pair; the first IP of the managed CIDR becomes the gateway on cilium_host. For each container, the CNI plugin creates a veth pair, assigns IP, and installs BPF rules.

Fig 16. Cilium host‑networking

5.3 Multi‑Host Networking

Two common approaches are supported:

VxLAN tunnel (creates cilium_vxlan device as VTEP).

BGP direct routing (requires BGP agents on each node).

5.4 Pros & Cons

Pros

Native L4‑L7 security policies via BPF.

O(1) policy enforcement, far faster than iptables.

High‑performance data plane (veth, IPVLAN).

Dual‑stack IPv4/IPv6 support.

Can run on top of Flannel for gradual migration.

Active open‑source community and commercial backing.

Cons

Requires recent Linux kernel (≥4.8, preferably ≥4.14).

Relatively new; limited large‑scale production case studies.

Steeper learning curve; BPF programming demands solid C and kernel knowledge, increasing development and operational cost.

References

OpenStack Doc: Networking Concepts

Cisco Data Center Spine‑and‑Leaf Architecture: Design Overview

ovs‑vswitchd: Fix high cpu utilization when acquire idle lock fails

openvswitch port mirroring only mirrors egress traffic

Lyft CNI plugin

Netflix: run container at scale

Cilium Project

Cilium Cheat Sheet

Cilium Code Walk Through: CNI Create Network

Amazon EKS - Managed Kubernetes Service

Cilium: API Aware Networking & Network Security for Microservices using BPF & XDP

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

network architecturecloud computingSDNcontainer networking
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.