How Ctrip International Flights Scaled with Cloud‑Native Practices and Cut Costs
This article shares Ctrip International Flights' cloud‑native journey, detailing how they adopted infrastructure‑as‑code, automated CI/CD pipelines, centralized logging and monitoring with Prometheus, Grafana and Thanos, and applied elastic scaling, spot instances, and network optimizations to reduce operational costs while maintaining high availability.
Background
Ctrip International Ticketing aggregates flight data from suppliers worldwide and operates services in regions such as the US, Germany, and Singapore. To avoid the cost and operational burden of building private data centers, the team adopted public‑cloud services and followed cloud‑native best practices to achieve scalability, high availability, and loose coupling.
Cloud Migration Practices
1. Infrastructure as Code (IaC)
All cloud resources are defined in a dedicated Terraform repository. Each change is version‑controlled, reviewed, and applied automatically through CI/CD pipelines. The workflow is:
Developer pushes Terraform code to the IaC repo.
CI pipeline runs terraform fmt, terraform validate, and terraform plan.
After peer review, the pipeline executes terraform apply against the target environment.
Managed Kubernetes services (e.g., Amazon EKS, Azure AKS, GKE) are used so that a production‑grade cluster can be provisioned in minutes, eliminating the need for manual control‑plane management.
2. Centralized Logging
A DaemonSet runs a log‑collector (e.g., Fluent Bit) on every node. The collector reads container stdout/stderr streams, enriches them with metadata, and forwards the logs to a managed Elasticsearch cluster. Applications do not need to embed logging libraries; they simply write to standard output.
# Example DaemonSet manifest (simplified)
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: fluent-bit
spec:
selector:
matchLabels:
name: fluent-bit
template:
metadata:
labels:
name: fluent-bit
spec:
containers:
- name: fluent-bit
image: fluent/fluent-bit:1.9
args: ["-c", "/fluent-bit/etc/fluent-bit.conf"]
volumeMounts:
- name: varlog
mountPath: /var/log
volumes:
- name: varlog
hostPath:
path: /var/logLogs are visualized in Kibana, and retention policies are managed at the Elasticsearch level.
3. Monitoring and Alerting
The monitoring stack consists of Prometheus, the Prometheus Operator, Thanos, and Grafana.
Prometheus Operator deploys a Prometheus instance per Kubernetes namespace via a CustomResourceDefinition (CRD). This isolates workloads and avoids a single point of failure.
Thanos Sidecar runs alongside each Prometheus, uploading raw blocks to an S3 bucket every two hours.
Thanos Query aggregates data from all sidecars and the object store, providing a global query endpoint.
Thanos Compact downsamples and compresses older blocks to reduce storage cost.
Grafana connects only to the Thanos Query endpoint, giving a unified view of metrics across all services.
Example Prometheus Operator configuration (simplified):
apiVersion: monitoring.coreos.com/v1
kind: Prometheus
metadata:
name: prometheus-us-east
namespace: us-east
spec:
serviceAccountName: prometheus
serviceMonitorSelector:
matchLabels:
team: ticketing
resources:
requests:
memory: 2Gi
cpu: 500m
storage:
volumeClaimTemplate:
spec:
storageClassName: gp2
resources:
requests:
storage: 50GiGrafana dashboards query the Thanos endpoint, eliminating the need to configure multiple data sources.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
