How I Cut My Kubernetes Cloud Bill by 60% in 3 Months – Proven Strategies
Facing a 35‑million‑yuan monthly Kubernetes bill, the author analyzed hidden cost components, implemented five optimization campaigns—including resource request tuning, autoscaling, spot instances, storage tiering, and network consolidation—and reduced monthly expenses by 60% while boosting performance, delivering a detailed, reproducible methodology.
Introduction
"Why did the cloud bill increase by 20% this month?" The CFO’s question spurred a deep dive into a Kubernetes cluster on Alibaba Cloud that was costing 350,000 CNY per month. Over three months of systematic optimization, the monthly cost dropped to 140,000 CNY—a 60% reduction—while performance improved by 30%.
Technical Background: The Hidden Truth of Kubernetes Costs
Cost Composition
Many assume cloud cost equals server cost, but it’s far more complex. For Alibaba Cloud the breakdown is roughly:
Compute (60‑70%): ECS instance fees (pay‑as‑you‑go vs. subscription), reserved instance coupons, spot instances.
Storage (15‑20%): Cloud disks (SSD, ESSD), NAS, OSS object storage.
Network (10‑15%): Public bandwidth (fixed vs. usage‑based), intra‑zone traffic, SLB load balancer fees.
Other (5‑10%): Snapshots, monitoring, image registry.
Common Causes of Resource Waste
According to CNCF surveys, average Kubernetes utilization is only 25‑35%, meaning 60‑75% of resources are idle.
Over‑provisioned resources: Developers request excess CPU/memory, many pods lack limits, request‑limit gaps are large.
Lack of autoscaling: Fixed pod counts, HPA not enabled, Cluster Autoscaler missing.
Fragmented resources: Uneven node utilization, missing node/pod affinity, many small nodes.
Poor storage usage: PersistentVolumeClaims never deleted, oversized PVCs, high‑performance disks used for low‑value data.
Initial State
Cluster details before optimization:
Kubernetes version: 1.24
45 ECS nodes (8 vCPU, 16 GiB)
~300 Pods, 35 micro‑services
Monthly cost composition (350,000 CNY):
ECS instances: 240,000 CNY (68.5%)
Cloud disks: 45,000 CNY (12.8%)
Network bandwidth: 30,000 CNY (8.6%)
SLB load balancer: 20,000 CNY (5.7%)
Other (monitoring, logs, etc.): 15,000 CNY (4.4%)
CPU avg. utilization 28%, memory 35%, storage 42% – clear signs of waste.
Core Content: Five Battles of Kubernetes Cost Optimization
Battle 1 – Optimize Resource Requests & Limits
Diagnosis
Collect pod usage with kubectl top pods and inspect requests/limits.
# 1. View all pod resource configs and usage
kubectl top pods --all-namespaces
# 2. Extract requests & limits
kubectl get pods --all-namespaces -o json | jq ...
# 3. List low‑usage pods
kubectl top pods --all-namespaces --sort-by='cpu' | tail -n 50Findings
67% of Pods have no Request/Limit.
Many Pods request 2 CPU/4 GiB but actually use 200 mCPU/512 MiB.
Requests far exceed limits (e.g., request 1 CPU, limit 4 CPU).
Solution
1️⃣ Set realistic requests based on real usage and define limits. 2️⃣ Create a resource‑configuration matrix per service type.
# Before (no resources)
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-service
spec:
replicas: 3
template:
spec:
containers:
- name: user-service
image: user-service:v1.0
# After (optimized)
apiVersion: apps/v1
kind: Deployment
metadata:
name: user-service
spec:
replicas: 3
template:
spec:
containers:
- name: user-service
image: user-service:v1.0
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
startupProbe: {...}
readinessProbe: {...}
livenessProbe: {...}3️⃣ Use VPA (Vertical Pod Autoscaler) for ongoing recommendations.
# Install VPA
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
# Create VPA object
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: user-service-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: user-service
updatePolicy:
updateMode: "Off" # recommend onlyResult: CPU avg. utilization rose from 28% to 52%, memory from 35% to 58%; 12 nodes could be shut down, saving ~64,000 CNY per month.
Battle 2 – Implement Elastic Scaling
HPA (Horizontal Pod Autoscaler)
# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Verify
kubectl top nodes
kubectl top pods
# HPA manifest (example)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: user-service-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: user-service
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 803️⃣ Time‑based scaling with CronHPA for predictable traffic patterns.
# CronHPA example
apiVersion: autoscaling.alibabacloud.com/v1beta1
kind: CronHorizontalPodAutoscaler
metadata:
name: user-service-cron-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: user-service
jobs:
- name: scale-up-workday
schedule: "0 8 * * 1-5"
targetSize: 10
- name: scale-down-workday
schedule: "0 22 * * 1-5"
targetSize: 3
- name: scale-weekend
schedule: "0 0 * * 6,0"
targetSize: 2Cluster Autoscaler (Node‑level scaling)
# Node pool configuration (Alibaba Cloud ACK)
NodePoolName: default-pool
InstanceType: ecs.c6.2xlarge
MinSize: 5
MaxSize: 30
AutoScaling: enabled
# Deploy Cluster Autoscaler
apiVersion: apps/v1
kind: Deployment
metadata:
name: cluster-autoscaler
namespace: kube-system
spec:
replicas: 1
template:
spec:
serviceAccountName: cluster-autoscaler
containers:
- name: cluster-autoscaler
image: registry.cn-hangzhou.aliyuncs.com/acs/autoscaler:v1.6.0
command:
- ./cluster-autoscaler
- --cloud-provider=alicloud
- --nodes=5:30:default-pool
- --scale-down-enabled=true
- --scale-down-delay-after-add=10m
- --scale-down-unneeded-time=10m
- --scale-down-utilization-threshold=0.5Result: Node count fell from a fixed 45 to an average of 18, saving ~144,000 CNY per month.
Battle 3 – Use Spot (Preemptible) Instances
Spot instances cost 10‑20% of pay‑as‑you‑go rates but can be reclaimed.
Annual subscription: ~4,500 CNY/month
Pay‑as‑you‑go: ~584 CNY/month
Spot: ~73 CNY/month (1.6% of subscription price)
Suitable for stateless services, batch jobs, dev/test environments, and highly‑available workloads with multiple replicas.
# Spot node pool definition (Alibaba Cloud)
NodePoolName: spot-pool
InstanceType: ecs.c6.2xlarge
ChargeType: Spot
MinSize: 0
MaxSize: 20
Labels:
node-type: spot
taints:
spot: "true":NoSchedulePods that can tolerate the spot taint are scheduled there, and a termination‑handler drains nodes before reclamation.
# DaemonSet handling spot termination
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: spot-termination-handler
namespace: kube-system
spec:
template:
spec:
nodeSelector:
node-type: spot
containers:
- name: handler
image: termination-handler:v1.0
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
# Script watches cloud interruption API and drains the nodeResult: 40% of workload moved to spot, cutting node cost by ~30% and saving ~72,000 CNY per month.
Battle 4 – Storage Cost Optimization
Diagnosis
# List all PVCs and their usage
kubectl get pvc --all-namespaces
# Find unused PVCs
kubectl get pvc --all-namespaces -o json | jq -r '.items[] | select(.status.phase == "Bound") | .metadata.namespace + "/" + .metadata.name'Issues found:
50+ stale PVCs from test environments.
Oversized PVCs (e.g., 100 Gi requested, only 5 Gi used).
High‑performance ESSD used for logs.
Solutions
Delete unused PVCs.
Adopt tiered StorageClasses (ESSD‑PL3 for databases, SSD for general apps, efficiency disks for logs, NAS for shared config).
Use emptyDir for temporary files and logs, with side‑car log‑cleaner.
Archive cold data to OSS object storage via FluentBit.
# Example tiered StorageClass definitions
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: alicloud-disk-essd-pl3
provisioner: diskplugin.csi.alibabacloud.com
parameters:
type: cloud_essd
performanceLevel: PL3
reclaimPolicy: Retain
allowVolumeExpansion: true
---
metadata:
name: alicloud-disk-ssd
parameters:
type: cloud_ssd
reclaimPolicy: Delete
---
metadata:
name: alicloud-disk-efficiency
parameters:
type: cloud_efficiency
reclaimPolicy: DeleteResult: Disk usage dropped from 8 TB to 3.2 TB, storage cost fell from 45,000 CNY to 18,000 CNY per month (saving 27,000 CNY).
Battle 5 – Network & Load‑Balancer Optimization
Diagnosis
# List all LoadBalancer services
kubectl get svc --all-namespaces -o wide | grep LoadBalancer
# Found 15 SLB instances, each ~60 CNY/day → ~27,000 CNY/monthSolution
Replace many LoadBalancer services with a single Ingress backed by one SLB.
Switch to pay‑by‑traffic bandwidth billing and use shared bandwidth packages.
Enable topology‑aware hints to keep intra‑zone traffic local.
# Ingress example sharing one SLB
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api-ingress
annotations:
kubernetes.io/ingress.class: nginx
spec:
rules:
- host: api.example.com
http:
paths:
- path: /user
pathType: Prefix
backend:
service:
name: user-service
port:
number: 8080
- path: /order
pathType: Prefix
backend:
service:
name: order-service
port:
number: 8080Result: SLB count reduced from 15 to 2, network cost cut from 30,000 CNY to 8,000 CNY per month (saving 22,000 CNY).
Cost‑Optimization Summary
Item Before (CNY/month) After (CNY/month) Savings (CNY) Savings %
ECS instances 240,000 96,000 144,000 60%
Cloud disks 45,000 18,000 27,000 60%
Network bandwidth 30,000 8,000 22,000 73%
SLB load balancer 20,000 5,000 15,000 75%
Other 15,000 13,000 2,000 13%
---------------------------------------------------------------
Total 350,000 140,000 210,000 60%Annual savings: 2.52 million CNY.
Best Practices & Pitfalls
Kubernetes Cost‑Optimization Golden Rules
Visibility first: You cannot optimize what you cannot measure.
Iterate fast: Tackle one issue at a time, validate, then proceed.
Safety first: Cost cuts must not compromise stability.
Automate: Manual tweaks are unsustainable; use VPA, HPA, CA, and CI pipelines.
Toolchain
Kubecost – cost analysis dashboard.
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm install kubecost kubecost/cost-analyzer --namespace kubecost --create-namespace
kubectl port-forward -n kubecost svc/kubecost-cost-analyzer 9090:9090Goldilocks – VPA recommendation UI.
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm install goldilocks fairwinds-stable/goldilocks --namespace goldilocks --create-namespace
kubectl label namespace default goldilocks.fairwinds.com/enabled=truePrometheus alerts for cost anomalies.
# Example alert for high memory usage
- alert: CostIncreaseAnomaly
expr: sum(container_memory_working_set_bytes{container!=""}) by (namespace) / 1024 / 1024 / 1024 > 100
for: 1h
labels:
severity: warning
annotations:
summary: "Namespace {{ $labels.namespace }} memory usage exceeds 100GB"Common Traps
Over‑optimizing: Setting requests too low harms stability – keep 20‑30% headroom for critical services.
Blind spot usage: Never run stateful services on spot instances without replication and graceful termination handling.
Ignoring hidden costs: Snapshots, logs, monitoring, and image storage can erode savings – clean them regularly.
Lack of monitoring: Without dashboards and alerts, costs creep back up.
Optimization Focus by Cluster Size
Small (<10 nodes): Focus on request/limit tuning, delete unused resources, consolidate LoadBalancers.
Medium (10‑100 nodes): Deploy HPA & Cluster Autoscaler, mix spot instances, tier storage, establish cost monitoring.
Large (>100 nodes): Purchase reserved instance coupons, manage multiple clusters, build internal FinOps platform, foster cost‑ownership culture.
Conclusion & Outlook
Systematic Kubernetes cost optimization cut monthly spend from 350,000 CNY to 140,000 CNY (60% reduction) while boosting performance by 30%. The five‑battle framework proves that cost savings and performance gains are not contradictory.
Key Takeaways
Resource configuration is fundamental: Proper requests/limits can save >50% of compute cost.
Elastic scaling is critical: HPA + Cluster Autoscaler align resources with demand.
Spot instances are a powerful lever: When used appropriately, they slash compute cost by up to 80%.
Storage optimization is often overlooked: Tiered storage and cleanup yield large savings.
Continuous monitoring sustains gains: Without observability, optimizations fade.
FinOps Culture
Cost optimization is an ongoing practice, not a one‑off project. Promote cost awareness, transparency (chargeback), incentive mechanisms, and regular reviews to embed financial responsibility into engineering teams.
Future Trends
Maturing FinOps tools (Kubecost, CloudHealth).
AI‑driven automatic resource right‑sizing and cost prediction.
More stable spot instance offerings.
Serverless containers that charge strictly by actual usage.
Mastering Kubernetes’s resource model and applying a systematic optimization methodology remains essential for any operations or cloud‑native engineer.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
