Cloud Computing 8 min read

Improving Cloud Cost Allocation and Resource Utilization through Catalog, Tags, and Automated Monitoring

This article describes how a tech team built a catalog‑based cost‑allocation system, leveraged cloud tags and Kubernetes labels, used Prometheus data for scaling decisions, and combined reserved, spot, and on‑demand instances to boost cloud resource utilization while keeping services stable.

Liulishuo Tech Team

Feb 4, 2021

Improving Cloud Cost Allocation and Resource Utilization through Catalog, Tags, and Automated Monitoring

Background

When a company deeply invests in cloud services, it often faces low resource utilization and difficulty automating cost sharing across technical teams. The challenge is to raise utilization without impacting business, and to rationally scale resources up or down.

Cost Allocation

All major cloud providers expose a tagging system (e.g., AWS Tagging Strategies) and Kubernetes offers a label system; these can be correlated. We built an internal Catalog system that binds resources to applications, members, owners, and teams, providing a single source of truth for cost attribution.

When ownership changes, only the Catalog needs updating. For resources shared by multiple teams, we first allocate costs based on clear relationships, then distribute the remaining unassigned costs proportionally across business lines, achieving consensus.

Public‑support teams (big data, infrastructure, middle‑platform) receive cost shares based on the overall business‑line proportion, and we calculate R&D cost ratios for each line, producing month‑over‑month and year‑over‑year reports.

Result: an automated monthly cost‑analysis report and a real‑time monitoring dashboard.

Scaling Rationalization

Using historical Prometheus metrics, we compute weekly CPU, Memory, and storage IOPS utilization for all resources. The core weekly report template is shown below:

Data‑driven scaling has not caused any production incidents.

Improving Utilization

Kubernetes already provides powerful bin‑packing capabilities; see the Bin Packing Problem for details.

While Horizontal Pod Autoscaler (HPA) can lag during sudden load spikes, we employ CronHPA for scheduled scaling and use extensive Prometheus data to evaluate baseline pod resources and elasticity windows, reducing HPA‑related scaling failures.

We also offer a Resource Recommendation (RR) service that suggests optimal CPU/Memory requests based on observed utilization, helping teams avoid over‑provisioning while preventing OOM‑Kill or throttling. The service is stable in test environments and will be rolled out to production.

Cloud Provider Optimization

Key cloud purchasing options:

RI (Reserved Instances): pre‑pay for a portion of capacity at a discount.

Spot: bid on excess capacity.

OD (On‑Demand): pay‑as‑you‑go.

Savings Plan: a unified pool based on compute usage.

These options can be combined: use RI‑covered OD as a baseline, supplement with Spot + OD for burst capacity, and gradually introduce Savings Plans to compress residual costs. Autoscaling groups mix OD and Spot, falling back to OD when Spot is unavailable.

Best practice targets ~50% RI coverage during peaks and ~80% daily, supplemented by Savings Plans and Spot for sudden demand spikes. We are building tools to balance OD and Spot automatically.

Conclusion

By integrating data, tooling, and cloud‑vendor purchasing strategies, we have significantly reduced waste while maintaining service stability. Ongoing work links business metrics (DAU, MAU) to cloud consumption, improving budget accuracy for infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

autoscaling resource utilization Cloud Cost cloud-tagging cost-optimization

Written by

Liulishuo Tech Team

Help everyone become a global citizen!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.