Improving Cloud Cost Allocation and Resource Utilization through Catalog, Tags, and Automated Monitoring
This article describes how a tech team built a catalog‑based cost‑allocation system, leveraged cloud tags and Kubernetes labels, used Prometheus data for scaling decisions, and combined reserved, spot, and on‑demand instances to boost cloud resource utilization while keeping services stable.
Background
When a company deeply invests in cloud services, it often faces low resource utilization and difficulty automating cost sharing across technical teams. The challenge is to raise utilization without impacting business, and to rationally scale resources up or down.
Cost Allocation
All major cloud providers expose a tagging system (e.g., AWS Tagging Strategies) and Kubernetes offers a label system; these can be correlated. We built an internal Catalog system that binds resources to applications, members, owners, and teams, providing a single source of truth for cost attribution.
When ownership changes, only the Catalog needs updating. For resources shared by multiple teams, we first allocate costs based on clear relationships, then distribute the remaining unassigned costs proportionally across business lines, achieving consensus.
Public‑support teams (big data, infrastructure, middle‑platform) receive cost shares based on the overall business‑line proportion, and we calculate R&D cost ratios for each line, producing month‑over‑month and year‑over‑year reports.
Result: an automated monthly cost‑analysis report and a real‑time monitoring dashboard.
Scaling Rationalization
Using historical Prometheus metrics, we compute weekly CPU, Memory, and storage IOPS utilization for all resources. The core weekly report template is shown below:
Data‑driven scaling has not caused any production incidents.
Improving Utilization
Kubernetes already provides powerful bin‑packing capabilities; see the Bin Packing Problem for details.
While Horizontal Pod Autoscaler (HPA) can lag during sudden load spikes, we employ CronHPA for scheduled scaling and use extensive Prometheus data to evaluate baseline pod resources and elasticity windows, reducing HPA‑related scaling failures.
We also offer a Resource Recommendation (RR) service that suggests optimal CPU/Memory requests based on observed utilization, helping teams avoid over‑provisioning while preventing OOM‑Kill or throttling. The service is stable in test environments and will be rolled out to production.
Cloud Provider Optimization
Key cloud purchasing options:
RI (Reserved Instances): pre‑pay for a portion of capacity at a discount.
Spot: bid on excess capacity.
OD (On‑Demand): pay‑as‑you‑go.
Savings Plan: a unified pool based on compute usage.
These options can be combined: use RI‑covered OD as a baseline, supplement with Spot + OD for burst capacity, and gradually introduce Savings Plans to compress residual costs. Autoscaling groups mix OD and Spot, falling back to OD when Spot is unavailable.
Best practice targets ~50% RI coverage during peaks and ~80% daily, supplemented by Savings Plans and Spot for sudden demand spikes. We are building tools to balance OD and Spot automatically.
Conclusion
By integrating data, tooling, and cloud‑vendor purchasing strategies, we have significantly reduced waste while maintaining service stability. Ongoing work links business metrics (DAU, MAU) to cloud consumption, improving budget accuracy for infrastructure.
Liulishuo Tech Team
Help everyone become a global citizen!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.