Cloud Native 16 min read

Design and Practices of Tongcheng‑Elong Container Platform: Cloud‑Native Architecture, Scheduling, and Resource Optimization

This article details Tongcheng‑Elong's journey from bare‑metal to a Kubernetes‑based cloud‑native platform, describing its architecture, the challenges of isolation, scheduling, resource utilization and promotion, and the engineering solutions—including custom scheduling, CPU binding, IP fixation, and over‑commit strategies—implemented to improve efficiency and reliability.

Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Tongcheng Travel Technology Center
Design and Practices of Tongcheng‑Elong Container Platform: Cloud‑Native Architecture, Scheduling, and Resource Optimization

The authors, senior architects and engineers from Tongcheng‑Elong, introduce their transition from physical machines and virtual machines to a Kubernetes‑driven cloud‑native container platform, aiming to increase resource utilization and provide a standardized platform for business teams.

The platform’s architecture consists of five application categories, a front‑end layer, an API aggregation layer, and a cluster‑management layer that abstracts diverse resources (Docker, VM, bare‑metal, big‑data workloads) behind unified service APIs.

Key operational problems encountered include container isolation (CPU, NUMA, IP binding), scheduling conflicts, low resource utilization, and difficulties in promoting the platform to internal teams.

Solutions implemented are: extending Kubernetes Topology Manager and Kubelet to support explicit CPU core and NUMA binding; adding IP‑fixed capabilities via CNI plugin hooks; building a custom scheduling framework (Wangler‑Schedule) with pre‑filter, scoring, and binding interfaces that can handle heterogeneous resources and simulate scheduling outcomes; collecting detailed scheduling logs to evaluate algorithm impact; and applying resource over‑commit techniques through request/limit compression and node‑level over‑sell factors.

Additional practices cover cloud‑native application migration (e.g., Flink Job/Session clusters), mixed deployment of online and offline workloads to smooth diurnal load spikes, and a set of promotion guidelines to gain business and operations trust.

The article concludes by summarizing the lessons learned, emphasizing the importance of user‑centric, low‑friction migration paths, and inviting readers interested in containers, Kubernetes, and resource scheduling to contact the team.

cloud-nativeperformance optimizationkubernetesInfrastructureResource Schedulingcontainer platform
Tongcheng Travel Technology Center
Written by

Tongcheng Travel Technology Center

Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.