KubeCost: Kubernetes-Based Resource Cost Analysis and Allocation System
KubeCost, developed by NetEase Cloud Music, is a low‑intrusion, scalable Kubernetes cost analysis system that allocates resource expenses using peak‑or‑usage billing models, supports hybrid‑multi‑cloud pricing, aggregates per‑pod CPU/memory/GPU costs, and stores data efficiently in ClickHouse for reliable, business‑oriented financial insight.
This article introduces KubeCost, a Kubernetes-based resource cost analysis tool developed by NetEase Cloud Music to address IT cost management challenges in the cloud-native era.
Background and Challenges:
Many internet companies have entered a stable development phase where cost control has become critical. IT costs typically account for 1/3 of total operational costs (technology to human resource ratio is approximately 1:2 to 1:2.5). With the adoption of Kubernetes, containers, and DevOps practices, resource management has become more complex. NetEase Cloud Music achieved 50%+ peak resource utilization through containerization, oversubscription, unified scheduling, and hybrid cloud deployment, saving tens of millions annually. However, challenges remain: resource growth continues rapidly with easy DevOps access, and the "big ledger problem" makes it difficult to allocate costs to business lines and evaluate ROI.
Key Challenges Identified:
Decentralization: Traditional centralized financial budgeting is shifting to business-oriented distributed decision-making
Dynamic Changes: Cloud environments and elastic capabilities cause costs to vary with business load
Excess Waste: Easy access to resources often leads to over-provisioning
KubeCost Features:
Multiple Billing Models: Supports annual reserved and pay-as-you-go pricing. For reserved resources, costs are allocated based on peak usage; for spot/low-utilization periods, actual usage-based allocation is applied.
Hybrid/Multi-Cloud Support: Handles different pricing models across internal resources and public clouds (Aliyun, AWS).
Billing Model: Follows OpenCost specification standard. Core principle: allocate = Max(Usage, request) . Base billing unit is 10 minutes, aligned with wall-clock time for stability.
Supported Resource Types: CPU, Memory, GPU, and more. Costs are calculated per POD by aggregating individual resource costs (CPU, memory, etc.).
Rich Filtering and Aggregation: Supports label-based filtering and aggregation by Namespace, Cluster, and POD labels.
Architecture Design Principles:
Low Intrusion: Uses sidecar-less, metrics-based collection approach
Reliability: 3+ replica deployment for ApiServer/etcd; Prometheus with dual backup; node failure has minimal impact
Scalability: Supports 100k+ PODs; uses ClickHouse for storage (~20GB/month for 120k PODs at 10min intervals)
Extensibility: Plugin-based billing logic for flexibility
Data Model:
Uses ClickHouse ReplacingMergeTree for efficient storage and fast retry capabilities:
CREATE TABLE IF NOT EXISTS kubecost.kube_billing_infos
(
create_time Int64 COMMENT 'record create time',
start_time Int64 COMMENT 'billing start time',
end_time Int64 COMMENT 'billing end time',
item String COMMENT 'billing item, example: cpu, mem, gpu, etc',
cost Float64 COMMENT 'billing cost',
currency String COMMENT 'billing currency',
entity_primary_key String COMMENT 'entity primary key, cluster/namespace/pod/container',
usage_info Map(String, Float64) COMMENT 'etc:usage,request,allocate',
label_info Map(String, String) COMMENT 'basic labels',
price_info String COMMENT 'cost price info'
) Engine = ReplacingMergeTree(create_time)
PARTITION BY toYYYYMM(FROM_UNIXTIME(start_time))
ORDER BY (start_time, end_time, item, entity_primary_key)NetEase Cloud Music Tech Team
Official account of NetEase Cloud Music Tech Team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.