How We Cut Kubernetes Costs by 40% Without Switching Platforms
By rethinking resource requests, eliminating unused workloads, downsizing node types, fine‑tuning autoscaling, and trimming log storage, a team reduced their Kubernetes bill by 40% while keeping the same cloud provider, demonstrating that most cost overruns stem from misconfiguration rather than the platform itself.
Many assume rising Kubernetes bills are an unavoidable cost of running workloads in the cloud, but the real issue is often how the cluster is configured and used.
Step 1: Right‑size resource requests
Using Prometheus metrics and the Vertical Pod Autoscaler (VPA) in recommendation mode, the team discovered that about 70% of Pods requested 2‑3× more CPU and memory than they actually used. By lowering those limits to realistic values, the cluster’s autoscaler removed several nodes overnight, saving roughly 15% of the cost.
Step 2: Remove ghost workloads
Unused or forgotten workloads—such as test clusters, old batch jobs, and PR preview applications—were identified with the following command:
kubectl get pods --all-namespaces --sort-by=.metadata.creationTimestampAfter deleting these resources and adding a TTL controller for preview environments, the bill dropped an additional 10% .
Step 3: Use smaller nodes instead of large ones
Large instances (e.g., >32 vCPU) often leave half a node idle, wasting money. Switching to smaller instance types like m6i.large and letting the autoscaler adjust capacity increased utilization and reduced costs by about 8% .
Step 4: Make autoscaling actually work
Enabling the Horizontal Pod Autoscaler (HPA) alone is insufficient; the team set a high CPU threshold that never triggered. They adjusted the target to the 90th‑percentile usage and added custom metrics (e.g., request rate via Prometheus Adapter). During low‑traffic periods pods and nodes scaled down, saving another 5‑7% .
Step 5: Trim logs, storage, and data
Audit, debug, and system logs were stored on expensive block storage. By moving these logs to S3 Glacier, shortening the retention of non‑critical logs to seven days, and stopping debug‑log generation in production, the team saved roughly 6% of the overall spend.
After weeks of cleanup, the Kubernetes bill fell by about 40% without changing cloud providers. The cluster became smaller, faster, easier to maintain, and the team now reviews resource usage weekly to prevent waste.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
