Cloud Computing 22 min read

How to Master Spot Instances for Cost‑Effective Cloud Scaling

This article explains what Spot (preemptible) instances are, compares them with on‑demand and reserved instances, details AWS Spot pricing and signals, and provides practical strategies—including node‑group design, Kubernetes scheduling, health checks, and rollback plans—to reliably reduce cloud costs while maintaining application availability.

Huolala Tech

Dec 1, 2022

How to Master Spot Instances for Cost‑Effective Cloud Scaling

1. Introduction

The internet industry is shifting from rapid, uncontrolled growth to refined, cost‑focused operations, making cloud resource cost reduction a priority. Spot (preemptible) instances offer a way to use fewer resources for the same workload or more workload for the same resources.

2. Spot Instance Overview

Spot instances are low‑price, preemptible ECS instances. Users bid a price; idle capacity is sold at a discount. If on‑demand demand rises or a higher bid appears, the Spot instance can be reclaimed.

Priority order for AWS capacity: Reserved > On‑Demand > Spot.

Reserved: pay upfront for guaranteed capacity.

On‑Demand: pay when needed.

Spot: bid for unused capacity, may be reclaimed at any time.

Spot instances have four key characteristics:

Very cheap (10‑30% of on‑demand price).

Not suitable for all workloads.

Availability is not guaranteed; instances may be reclaimed.

Reclamation can happen at any time, requiring proper handling.

2.1 Spot History

Different cloud providers use various names for Spot instances (AWS EC2 Spot, Google Preemptible VMs, Alibaba Cloud Spot, Azure Low‑Priority VMs, etc.). AWS Spot has been available for over ten years, with pricing refreshed hourly.

3. How to Use Spot Effectively

3.1 Spot vs. On‑Demand Comparison

Metric

Spot

On‑Demand

Cost

Cheap

Stable

Launch

Specify type; may start immediately if capacity exists

Default type; always starts

Available Capacity

May be limited

Usually available

Hourly Price

Varies with supply/demand

Fixed

Interrupt Warning

High‑risk signal sent early

None

Interrupt Scenario

Capacity shortage, price spikes, or constraint violations

Manual termination or scaling down

3.2 Spot Usage Practices

Mix Spot, On‑Demand, and Reserved instances.

Avoid running non‑interruptible workloads on Spot; use fault‑tolerant, flexible jobs (big data, HPC, stateless web services, CI/CD, etc.).

Break long tasks into many short, asynchronous jobs to reduce interruption impact.

Leverage Spot price fluctuations by running large Spot clusters during off‑peak hours.

Implement intelligent scheduling that supports graceful termination.

Use AWS Spot Instance Advisor to select low‑interrupt pools.

4. Huolala Overseas Spot Deployment

Huolala’s test and pre‑release environments use Spot for 90% of ECS capacity, saving 57‑63% of costs.

4.1 Challenges

Resource‑level: Spot reclamation frequency and new node capacity.

Event‑level: Uncertain reclamation signals (RBR may be missing; ITN arrives only 2 minutes before termination).

Application resilience: Need graceful shutdown, maintain service during replica loss, and fast startup.

4.2 System Architecture

Key points:

Create dedicated Spot node groups and schedule only qualified workloads there.

Develop a custom “node‑status Controller” to ensure essential daemonsets are healthy before scheduling Pods on Spot nodes.

Use a “Rescheduler Controller” to rebalance Pods during low‑peak periods.

Collect Spot lifecycle data via EventBridge, feed it to Prometheus, and adjust node‑group configurations based on AZ‑specific reclamation rates.

4.3 Admission Rules

Only admit Pods that meet these criteria:

System services (e.g., CoreDNS) run on non‑Spot nodes.

Stateful services are excluded.

Replica count > 1.

Can shut down gracefully within 2 minutes.

Startup time < 2 minutes.

Expose HTTP health checks.

Support graceful scale‑down without 5XX errors.

4.4 Specific Strategies

4.4.1 Node‑Group Design

Deploy multiple AZs and instance types; Auto Scaling Groups (ASG) automatically select the most abundant AZ, reducing Spot reclamation probability.

4.4.2 CA and Node Expansion

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-expander
  namespace: kube-system
data:
  priorities: |-
    20:
    - spot-.*
    10:
    - .*

Configure Cluster Autoscaler (CA) to prioritize Spot node groups, falling back to On‑Demand when Spot cannot scale.

kubectl -n kube-system edit deploy cluster-autoscaler
# add flags:
#   --expander=priority
#   --max-node-provision-time=5m0s

4.4.3 Over‑provisioning Pods

apiVersion: apps/v1
kind: Deployment
metadata:
  name: overprovisioning
spec:
  replicas: 1
  template:
    spec:
      priorityClassName: overprovisioning
      containers:
      - name: reserve-resources
        image: k8s.gcr.io/pause
        resources:
          requests:
            cpu: "200m"

4.4.4 Pod Disruption Budgets (PDB)

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: appid
spec:
  minAvailable: 4
  selector:
    matchLabels:
      app: appid

4.4.5 Init Containers for Dependency Checks

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ${app}
spec:
  template:
    spec:
      initContainers:
      - name: check-consul
        image: xxxx/busybox:1
        env:
        - name: HOST_IP
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        command:
        - /bin/sh
        - -c
        - code=1; while [ $(code) -ne 0 ]; do sleep 1; curl http://$(HOST_IP):8500/v1/status/leader 2>/dev/null | grep -E ".+"; code=$?; echo "return code is $(code)"; done;

4.4.6 PreStop and Startup Probes

# PreStop (shell)
lifecycle:
  preStop:
    exec:
      command: ["sh", "-c", "sleep 15"]

# Startup probe for Java services
startupProbe:
  failureThreshold: 30
  exec:
    command: ["sh", "-c", "code=`curl -m 10 -o /dev/null -s -w %{http_code} 127.0.0.1:${httpStartupProbePort}/${httpStartupProbeUrl}`; if [ $code -ge 200 -a $code -lt 400 ]; then sleep 6; else exit 1; fi"]
  initialDelaySeconds: 60
  periodSeconds: 10
  timeoutSeconds: 10

5. Rollback Strategies

When Spot nodes encounter large‑scale issues, fallback to On‑Demand (OD) nodes:

Single‑application rollback via CI/CD trigger.

For a specific instance type with high reclamation, create a new node group without that type, mark the old group unschedulable, and drain it.

Full Spot rollback: expand OD node group, shrink Spot group, and migrate workloads.

6. Conclusion

Spot instances lower costs at the expense of stability risk. By combining proper node‑group design, Kubernetes scheduling policies, health‑check mechanisms, and graceful termination strategies, organizations can achieve cost savings while maintaining high availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Computing Kubernetes cost optimization AWS spot instances

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.