Migrating Massive Big‑Data Services to Kubernetes: Lessons from Tongcheng‑eLong
This article details how Tongcheng‑eLong transitioned from Docker‑Host deployments to a Kubernetes‑based platform for hundreds of storage and compute services, covering network integration, IP management, service synchronization, storage strategies, operator development, monitoring, logging, and the challenges and future plans they encountered.
Background and Motivation
As the data‑center backbone of Tongcheng‑eLong, the team generates massive amounts of data daily. In 2018 they migrated all services to Docker Host mode, but script‑based container management left resource pools fragmented, made rolling updates and failover manual, and lacked a unified view of cluster resources. To address these issues they began moving services to Kubernetes in 2019, initially deploying storage services such as Elasticsearch, TiKV, Kudu, Kafka, and compute services like Hive and Spark SQL.
Connecting Kubernetes to the Outside World
The cluster uses an OVS virtual switch to bridge the physical network and the container network, creating a single L2 domain so external machines can reach pods. Pods receive stable IPs via an "IP Local" scheme that records the mapping of Namespace+PodName to IP in a dedicated etcd, retaining the IP for three days. For special cases they can assign fixed IPs, though this requires controller support for custom objects or StatefulSets.
Network cleanup is handled by periodically removing stale veth pairs created by the Contiv netplugin.
Service exposure is synchronized with the internal TVS (four‑layer load balancer) using the Kubernetes Java client to watch Service and Endpoints events. Because leader switches can replay old events, the synchronization logic is made idempotent and checks for Service existence before deleting TVS VIPs.
All internal DNS names are registered to the corporate DNS service by adjusting the cluster’s domain suffix and restarting CoreDNS, then syncing the records back to the corporate DNS.
Storage and Compute on Kubernetes
Compute workloads (Hive, Spark SQL, TensorFlow Server) are deployed in separate namespaces with node‑label‑based resource pools. Hive is managed via a custom Operator built on top of an advanced StatefulSet controller (AdvanceStatefulSet) that allows explicit pod ordinal shutdown.
For storage, the team adopts a hybrid Local PV + Ceph approach: performance‑critical services (Elasticsearch, TiKV, Kafka) use Local PV on SSDs, while less critical components (Jupyter) use Ceph RBD images. Initially they created Local PV in bulk, later moving to on‑demand PV creation to better control disk usage.
Elasticsearch is deployed via a StatefulSet; because standard StatefulSet cannot delete arbitrary pods, they built an "AdvanceStatefulSet" that accepts a pod ordinal to bring down a specific pod. Example configuration:
offlineStrategy:
podOrdinals:
- 0They considered the official TiDB Operator but chose not to adopt it due to its beta status and missing CRD support for their custom StatefulSet.
When deploying Kudu they encountered DNS resolution issues because external machines could not resolve internal pod hostnames; they temporarily fell back to host‑network mode with careful port management.
Other services (Kafka, TiKV, PostgreSQL, Zookeeper) were migrated straightforwardly because they already ran on Docker.
Multi‑disk Local PV proved problematic for Hadoop‑style workloads that need many disks; the scheduler could not place each PV on a distinct physical disk. To work around this they experimented with HostPath, but HostPath lacks resource management and binding to node attributes. Their current compromise combines AdvanceStatefulSet, nodeSelector, and Affinity, though it is not ideal for large‑scale deployments.
Monitoring, Logging, and Operational Practices
Each Kubernetes cluster runs its own Prometheus + Grafana stack, with Thanos providing a global query layer and long‑term storage in Ceph. Monitoring dashboards are isolated per cluster.
Logging is implemented via a sidecar container injected into each pod, forwarding logs to Kafka, then to Flink and finally to Elasticsearch, where logs are retained for seven days. Alerting rules are defined on top of this pipeline.
Summary and Recommendations
The team concludes that deploying both storage and compute workloads on Kubernetes is feasible when external connectivity, DNS integration, and storage provisioning are properly handled. For performance‑critical services they recommend a host‑mode + Local PV strategy, while Helm templates can provide quick automation when mature Operators are unavailable.
They are developing a custom Operator framework to encapsulate common functionality across services, contributing to the TiDB Operator community, and leveraging Service‑Pod DNS support for static network and disk allocation.
Future work includes exploring DPDK for accelerated container networking, refining mixed‑deployment storage schemes (LVM + HostPath), and separating operational clusters from user clusters for security and usability.
Q&A
Q1: How to handle data disks when a pod migrates?
A: As long as the PV and PVC are not deleted, the pod will not move; use StatefulSet volume templates if needed.
Q2: Is running Zeppelin‑like tools on Docker stable?
A: Yes; the team now uses JupyterHub with Spark via Livy, providing stable, resource‑controlled environments.
Q3: How long is monitoring data retained?
A: Container metrics are kept for 12 days in Thanos (Ceph), while physical‑machine metrics are stored in Xiaomi’s Falcon system.
Q4: Does a pod’s Local PV migrate if the pod is rescheduled?
A: No; the pod will not move unless the PV/PVC are manually deleted, after which data must be copied to a new PV.
Q5: How does Thanos store data and how does query speed compare to native Prometheus?
A: Historical data is archived in Ceph; Prometheus still stores recent data locally. Query performance depends on data volume but is acceptable with horizontal scaling.
Q6: How is a pod’s fixed IP implemented?
A: Using Contiv’s netplugin CNI, the pod’s Namespace + name is mapped to a unique IP stored in a dedicated etcd; the custom AdvanceStatefulSet ensures stable pod names for IP assignment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
