Design and Practices of Tongcheng‑Elong Container Platform: Cloud‑Native Architecture, Scheduling, and Resource Optimization
This article details Tongcheng‑Elong's journey from bare‑metal to a Kubernetes‑based cloud‑native platform, describing its architecture, the challenges of isolation, scheduling, resource utilization and promotion, and the engineering solutions—including custom scheduling, CPU binding, IP fixation, and over‑commit strategies—implemented to improve efficiency and reliability.
The authors, senior architects and engineers from Tongcheng‑Elong, introduce their transition from physical machines and virtual machines to a Kubernetes‑driven cloud‑native container platform, aiming to increase resource utilization and provide a standardized platform for business teams.
The platform’s architecture consists of five application categories, a front‑end layer, an API aggregation layer, and a cluster‑management layer that abstracts diverse resources (Docker, VM, bare‑metal, big‑data workloads) behind unified service APIs.
Key operational problems encountered include container isolation (CPU, NUMA, IP binding), scheduling conflicts, low resource utilization, and difficulties in promoting the platform to internal teams.
Solutions implemented are: extending Kubernetes Topology Manager and Kubelet to support explicit CPU core and NUMA binding; adding IP‑fixed capabilities via CNI plugin hooks; building a custom scheduling framework (Wangler‑Schedule) with pre‑filter, scoring, and binding interfaces that can handle heterogeneous resources and simulate scheduling outcomes; collecting detailed scheduling logs to evaluate algorithm impact; and applying resource over‑commit techniques through request/limit compression and node‑level over‑sell factors.
Additional practices cover cloud‑native application migration (e.g., Flink Job/Session clusters), mixed deployment of online and offline workloads to smooth diurnal load spikes, and a set of promotion guidelines to gain business and operations trust.
The article concludes by summarizing the lessons learned, emphasizing the importance of user‑centric, low‑friction migration paths, and inviting readers interested in containers, Kubernetes, and resource scheduling to contact the team.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.