How Tencent Cut Kubernetes CPU Costs by 70%: A Full‑Scale Cloud‑Native Optimization Journey
This article presents a comprehensive, data‑driven case study of how Tencent’s internal Kubernetes/TKE platform reduced monthly CPU usage by up to 70% and memory usage by 50% through systematic cost data collection, VPA/HPA enhancements, custom scheduling, node‑level over‑commit, and safe node decommissioning, while maintaining zero‑incident reliability.
Background
Tencent’s internal Kubernetes/TKE platform runs millions of pods and incurs tens of millions of RMB in monthly costs. A cost‑optimization case study reduced CPU usage by up to 70% and memory usage by 50%. The implementation was open‑sourced as the Crane project ( https://github.com/gocrane/crane).
Data Collection & Analysis
Cost‑related metrics were gathered at multiple levels:
Cost bills – total cost per product/module and trend.
Resource‑level – CVM node counts, CPU/Memory/Extended‑Resource totals, utilization, and request‑allocation ratios per region and cluster.
Pod‑level – requested vs. actual CPU/Memory, request‑usage efficiency, OOM occurrences.
HPA effectiveness – coverage, min/max replica settings, trigger history.
Business analysis – workload patterns, service types (stateless vs. stateful), and workload kinds (Deployment, StatefulSet, custom Operator).
Key findings:
≈80% of cost is from CVM nodes used by three major business groups.
Node CPU utilization averages 5% (peak 15%); node allocation rate ~55% with uneven load.
Pod request values far exceed actual usage; some pods OOM without auto‑scaling.
HPA coverage is low and replica settings are sub‑optimal.
Optimization Measures
Pod resource‑usage improvement : Deploy Vertical Pod Autoscaler (VPA) to align requests with real usage, extend HPA to all components, and use CronHPA for periodic workloads.
Node allocation rate improvement : Choose instance types matching the observed 1:4 CPU‑to‑Memory ratio, switch scheduler priority from LeastRequestedPriority to MostRequestedPriority, enlarge pod CIDR range, and use dynamic scheduler + Descheduler for load balancing.
Node load improvement : Apply an Admission Webhook to lower node‑level requests, enable over‑commit of extended resources for BestEffort pods, apply VPA‑driven right‑sizing, and allow burstable QoS with safe over‑commit thresholds.
Billing optimization : Select the most cost‑effective billing mode (spot, reserved, pay‑as‑you‑go) per workload and choose the best‑price instance types.
Industry Landscape & Solution Selection
The primary levers identified were VPA and HPA. The open‑source VPA architecture consists of Metrics Server, History Storage (usually Prometheus), VPA Controller (Recommender + Updater), and VPA Admission Controller. Limitations include performance at large scale, lack of custom metrics, slow response to spikes, and weak observability.
HPA is built into Kubernetes and supports multiple metric sources ( metrics.k8s.io, custom.metrics.k8s.io, external.metrics.k8s.io). A typical HPA manifest (autoscaling/v2) looks like:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: php-apache
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: php-apache
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 50
- type: Pods
pods:
metric:
name: packets-per-second
target:
type: AverageValue
averageValue: 1k
- type: Object
object:
metric:
name: requests-per-second
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: main-route
target:
type: Value
value: 10kHPA drawbacks include latency in reaction, limited observability, and lack of dry‑run support. Google Autopilot combines VPA and HPA concepts and demonstrates effective vertical scaling in Borg.
Design & Implementation
Design goals were extensibility, observability, and stability:
Support multi‑business namespaces via a pluggable ComponentProvider and custom portrait algorithms.
Expose detailed metrics (scaling counts, queue length, OOM events, component readiness, etc.).
Enable safe staged rollout: dry‑run → gray‑release → adaptive throttling → node decommission.
Portrait Module
Consists of a Workload‑Controller that watches Deployments, StatefulSets, Jobs, and CronJobs to generate Portrait custom resources, and a Workload‑Recommender that merges real‑time metrics (metrics‑server, OOM events) with historical data (Prometheus, Elasticsearch) using algorithms such as exponential‑decay histogram, XGBoost, and SMA.
KMetis Module
KMetis provides a unified VPA/HPA/EHPA service and node‑scale capability. Core API resources are CSetScaler (per‑namespace scaling policies) and NodeScaler (node decommission tasks). The scaling workflow:
Periodically inspect workloads against CSetScaler expectations.
Coordinate VPA first via ScalerProvider, ResourceEstimator, UpdaterProvider, and RecordProvider.
Coordinate HPA afterwards using ReplicasEstimator with conflict‑avoidance logic.
Perform root‑cause analysis on high‑load nodes before scaling.
KMetis also supports custom horizontal scalers such as Crane’s EHPA with predictive Dsp algorithms.
Deployment, Release Strategy & Results
Controlled release process :
Dry‑run mode collects prediction data without mutating workloads.
Gray‑release uncovers hidden issues at scale.
Adaptive throttling limits concurrent scaling actions (e.g., max 20 simultaneous updates).
Safe node shutdown uses custom affinity tags and the NodeScaler workflow.
During dry‑run the system identified a 1:4 CPU‑to‑Memory ratio, prompting migration from 8c16g to 4c16g instances for a core service. Subsequent gray‑release and throttling enabled tens of thousands of safe scaling actions.
Node‑safe‑drain is achieved with an affinity configuration such as:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- preference:
matchExpressions:
- key: level
operator: In
values:
- small
- key: x.y.z/component
operator: In
values:
- normal
weight: 10Combined with MostRequestedPriority, dynamic scheduler, and Descheduler, node allocation rose from ~50% to 99% CPU and 88% Memory, while average CPU utilization increased from 5% to 21.4%.
Effectiveness
Business A – 70% CPU reduction.
Business B – 45% CPU reduction.
Business C – 50% CPU reduction.
Overall cost dropped dramatically with zero incidents throughout the rollout.
Conclusion
Stability
Raising pod density required careful handling of kernel/Docker/Kubelet bugs and Service‑LB unbinding delays. Issues were mitigated with NodeProblemDetectorPlus, graceful termination scripts, and rolling‑update strategies.
Future Direction
Further CPU utilization gains are expected by extending node‑level over‑commit techniques.
This end‑to‑end case study demonstrates a reproducible methodology for large‑scale Kubernetes cost optimization, from data collection and algorithmic portrait generation to safe, observable, and automated scaling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
