Cloud Computing 13 min read

Ctrip’s Practice of Using AWS Spot Instances for Cost Reduction and High Availability

This article details Ctrip’s large‑scale use of AWS Spot instances on Kubernetes, explaining the cost benefits, the challenges of spot interruptions, and the architectural and operational strategies—including multi‑AZ deployment, scheduling policies, autoscaling group design, and observability—that enable a 50% reduction in container costs while maintaining system stability and reliability.

Ctrip Technology

Dec 30, 2021

Ctrip’s Practice of Using AWS Spot Instances for Cost Reduction and High Availability

Ctrip’s Cloud Container & Service team leverages AWS Spot (bid) instances, which are offered at roughly 30% of on‑demand pricing, to significantly cut cloud costs. While Spot instances can be reclaimed at any time, careful design can mitigate stability risks.

The team observed that Spot instances are best suited for fault‑tolerant, stateless workloads; critical components such as Kubernetes core services (Scheduler, HPA, Cluster Autoscaler, Metrics Server) remain on on‑demand nodes to avoid disruption.

Spot interruption warnings are captured via two mechanisms: CloudWatch Events emitting {"action":"terminate","time":"2021-10-10T10:10:00Z"} and the instance metadata service at http://169.254.169.254/latest/meta-data/spot/instance-action. The preferred method uses CloudWatch Events to trigger a Lambda that, after authenticating via Parameter Store, calls the Kubernetes API to evict the affected node and migrate pods.

To ensure high availability, Ctrip designs a multi‑AZ architecture that spreads workloads across different Spot capacity pools, avoiding single‑zone failures and reducing the impact of simultaneous Spot reclamations. The deployment also mixes Spot and on‑demand instances for low‑traffic services to further lower risk.

Pod scheduling employs topologySpreadConstraints to enforce zone‑aware distribution, and an extended policy template allows dynamic adjustment of scheduling rules based on application needs. Example constraints:

topologySpreadConstraints:
- maxSkew: 1
  topologyKey: topology.kubernetes.io/zone
  whenUnsatisfiable: DoNotSchedule
  labelSelector:
    matchLabels:
      GroupidKey: "%s"

NodeGroups and Autoscaling Groups are defined per zone, instance type, and Spot/on‑demand flag, enabling precise scaling actions. When a zone experiences failures, Ctrip temporarily disables zone‑spread constraints, removes the faulty zone’s NodeGroup from the autoscaler, and may suspend Spot scaling to prevent further instability.

Long‑term governance includes collecting Spot interruption events, analyzing interruption frequency by zone and instance type, and using this data to adjust Spot ratios, plan capacity, and refine policies. Observability is achieved through CloudWatch EventBridge monitoring, Lambda processing, and dashboards that visualize interruption trends and system health.

Overall, Ctrip’s Spot instance strategy has halved container costs while maintaining reliability, and ongoing automation and data‑driven governance continue to balance cost savings with system stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Computing Kubernetes autoscaling cost optimization AWS Spot Spot Instance Management

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.