How to Master Spot Instances for Cost‑Effective Cloud Scaling
This article explains what Spot (preemptible) instances are, compares them with on‑demand and reserved instances, details AWS Spot pricing and signals, and provides practical strategies—including node‑group design, Kubernetes scheduling, health checks, and rollback plans—to reliably reduce cloud costs while maintaining application availability.
1. Introduction
The internet industry is shifting from rapid, uncontrolled growth to refined, cost‑focused operations, making cloud resource cost reduction a priority. Spot (preemptible) instances offer a way to use fewer resources for the same workload or more workload for the same resources.
2. Spot Instance Overview
Spot instances are low‑price, preemptible ECS instances. Users bid a price; idle capacity is sold at a discount. If on‑demand demand rises or a higher bid appears, the Spot instance can be reclaimed.
Priority order for AWS capacity: Reserved > On‑Demand > Spot.
Reserved: pay upfront for guaranteed capacity.
On‑Demand: pay when needed.
Spot: bid for unused capacity, may be reclaimed at any time.
Spot instances have four key characteristics:
Very cheap (10‑30% of on‑demand price).
Not suitable for all workloads.
Availability is not guaranteed; instances may be reclaimed.
Reclamation can happen at any time, requiring proper handling.
2.1 Spot History
Different cloud providers use various names for Spot instances (AWS EC2 Spot, Google Preemptible VMs, Alibaba Cloud Spot, Azure Low‑Priority VMs, etc.). AWS Spot has been available for over ten years, with pricing refreshed hourly.
3. How to Use Spot Effectively
3.1 Spot vs. On‑Demand Comparison
Metric
Spot
On‑Demand
Cost
Cheap
Stable
Launch
Specify type; may start immediately if capacity exists
Default type; always starts
Available Capacity
May be limited
Usually available
Hourly Price
Varies with supply/demand
Fixed
Interrupt Warning
High‑risk signal sent early
None
Interrupt Scenario
Capacity shortage, price spikes, or constraint violations
Manual termination or scaling down
3.2 Spot Usage Practices
Mix Spot, On‑Demand, and Reserved instances.
Avoid running non‑interruptible workloads on Spot; use fault‑tolerant, flexible jobs (big data, HPC, stateless web services, CI/CD, etc.).
Break long tasks into many short, asynchronous jobs to reduce interruption impact.
Leverage Spot price fluctuations by running large Spot clusters during off‑peak hours.
Implement intelligent scheduling that supports graceful termination.
Use AWS Spot Instance Advisor to select low‑interrupt pools.
4. Huolala Overseas Spot Deployment
Huolala’s test and pre‑release environments use Spot for 90% of ECS capacity, saving 57‑63% of costs.
4.1 Challenges
Resource‑level: Spot reclamation frequency and new node capacity.
Event‑level: Uncertain reclamation signals (RBR may be missing; ITN arrives only 2 minutes before termination).
Application resilience: Need graceful shutdown, maintain service during replica loss, and fast startup.
4.2 System Architecture
Key points:
Create dedicated Spot node groups and schedule only qualified workloads there.
Develop a custom “node‑status Controller” to ensure essential daemonsets are healthy before scheduling Pods on Spot nodes.
Use a “Rescheduler Controller” to rebalance Pods during low‑peak periods.
Collect Spot lifecycle data via EventBridge, feed it to Prometheus, and adjust node‑group configurations based on AZ‑specific reclamation rates.
4.3 Admission Rules
Only admit Pods that meet these criteria:
System services (e.g., CoreDNS) run on non‑Spot nodes.
Stateful services are excluded.
Replica count > 1.
Can shut down gracefully within 2 minutes.
Startup time < 2 minutes.
Expose HTTP health checks.
Support graceful scale‑down without 5XX errors.
4.4 Specific Strategies
4.4.1 Node‑Group Design
Deploy multiple AZs and instance types; Auto Scaling Groups (ASG) automatically select the most abundant AZ, reducing Spot reclamation probability.
4.4.2 CA and Node Expansion
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-autoscaler-priority-expander
namespace: kube-system
data:
priorities: |-
20:
- spot-.*
10:
- .*Configure Cluster Autoscaler (CA) to prioritize Spot node groups, falling back to On‑Demand when Spot cannot scale.
kubectl -n kube-system edit deploy cluster-autoscaler
# add flags:
# --expander=priority
# --max-node-provision-time=5m0s4.4.3 Over‑provisioning Pods
apiVersion: apps/v1
kind: Deployment
metadata:
name: overprovisioning
spec:
replicas: 1
template:
spec:
priorityClassName: overprovisioning
containers:
- name: reserve-resources
image: k8s.gcr.io/pause
resources:
requests:
cpu: "200m"4.4.4 Pod Disruption Budgets (PDB)
apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: appid
spec:
minAvailable: 4
selector:
matchLabels:
app: appid4.4.5 Init Containers for Dependency Checks
apiVersion: apps/v1
kind: Deployment
metadata:
name: ${app}
spec:
template:
spec:
initContainers:
- name: check-consul
image: xxxx/busybox:1
env:
- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
command:
- /bin/sh
- -c
- code=1; while [ $(code) -ne 0 ]; do sleep 1; curl http://$(HOST_IP):8500/v1/status/leader 2>/dev/null | grep -E ".+"; code=$?; echo "return code is $(code)"; done;4.4.6 PreStop and Startup Probes
# PreStop (shell)
lifecycle:
preStop:
exec:
command: ["sh", "-c", "sleep 15"] # Startup probe for Java services
startupProbe:
failureThreshold: 30
exec:
command: ["sh", "-c", "code=`curl -m 10 -o /dev/null -s -w %{http_code} 127.0.0.1:${httpStartupProbePort}/${httpStartupProbeUrl}`; if [ $code -ge 200 -a $code -lt 400 ]; then sleep 6; else exit 1; fi"]
initialDelaySeconds: 60
periodSeconds: 10
timeoutSeconds: 105. Rollback Strategies
When Spot nodes encounter large‑scale issues, fallback to On‑Demand (OD) nodes:
Single‑application rollback via CI/CD trigger.
For a specific instance type with high reclamation, create a new node group without that type, mark the old group unschedulable, and drain it.
Full Spot rollback: expand OD node group, shrink Spot group, and migrate workloads.
6. Conclusion
Spot instances lower costs at the expense of stability risk. By combining proper node‑group design, Kubernetes scheduling policies, health‑check mechanisms, and graceful termination strategies, organizations can achieve cost savings while maintaining high availability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
