How Youzan Built a Full‑Scale Data Cost Billing System: From SDK to Multi‑Dimensional Analysis
This article details Youzan's end‑to‑end construction of a unified data‑center cost billing system, covering background goals, multi‑type cost support, SDK‑based information collection, cost quantification for offline, real‑time and platform tools, full‑business coverage, multi‑dimensional analysis models, operational rollout, and future plans.
Background
In H1 2020 Youzan started a data‑center offline cost governance project and, in H2, expanded it to a global cost‑billing system that covers offline, real‑time, and platform‑tool workloads to improve cost transparency for the platform and its users.
Goals
Support multiple cost types (offline, real‑time, platform tools)
Cover all business lines and domains
Provide multi‑dimensional analysis methods
Enable operation across all channels
1. Multi‑type Cost Support
1.1 Cost Classification
Three service categories are defined:
Offline computation (e.g., Spark, Hive)
Real‑time computation (e.g., HBase, Druid, Kafka, Flink)
Platform tools (internal data‑R&D platform, asset‑governance platform, BI reporting platform)
1.2 Unified Data Collection SDK
A lightweight SDK collects hardware resources (CPU, memory, disk, NIC) and runtime metrics (task usage, client access, storage). The SDK provides a unified data model, Kafka‑based decoupling, and robust validation to avoid tight coupling and high maintenance costs.
1.3 Cost Quantification
1.3.1 Offline Cost
The offline cost calculation follows the methodology described in a previous article (reference omitted for brevity).
1.3.2 Real‑time Cost
The quantification process consists of three steps:
Collect total cost and total resource pool.
Derive resource unit prices using scarcity weighting and target utilization.
Combine SDK‑collected consumption data to compute task‑level costs.
Example for Flink:
cpu_price = total_cost * cpu_index / (total_cpu * load) memory_price = total_cost * memory_index / (total_memory * load) task_cost = cpu * cpu_price + memory * memory_price1.3.3 Platform Tool Cost
Because platform‑tool cost is relatively low, a simple pricing based on the machine model is used. All figures are anonymized, focusing on trends rather than absolute values.
2. Full‑Business Coverage
Business domains are organized into three layers to improve coverage to over 90%:
Business lines – external products.
Generic business domains – data‑warehouse middle layers.
Platform/module domains – internal platform components.
2.1 Cost Allocation
Costs are split into shared (allocation) and exclusive (dedicated) portions. Allocation ratios are derived from daily order volume. A two‑stage allocation distributes costs to business lines, business domains, and platform tools.
3. Multi‑dimensional Analysis Capabilities
A data model ties dimensions (business line, department, resource type) with metrics (incurred cost, reducible cost, reduced cost), forming a typical OLAP scenario. Spark Cube is used to generate multi‑dimensional datasets that support dynamic queries.
3.1 Interaction Design
Long‑term trend analysis.
Drill‑down analysis to task‑level cost.
Cascade analysis where selecting an anomaly updates related visualizations.
3.2 Views
The final billing view offers five perspectives (global, business line, business domain, department, individual) and supports dynamic trends, drill‑down, and cascade analysis.
4. Operational Rollout
A technical‑operations approach was used to drive adoption, including product launch videos, online/offline promotion channels, and targeted reminders (e.g., pop‑up prompts for partition lifecycle configuration). These actions increased click‑through rates and overall participation in cost‑saving actions.
5. Phase Summary
The team delivered:
A unified SDK for data collection.
A flexible data model for cost quantification across offline, real‑time, and platform‑tool workloads.
A multi‑dimensional query service supporting five analysis views and three analysis methods.
6. Future Plans
Extend coverage to all data‑center components as the ecosystem evolves.
Introduce intelligent cost anomaly detection and personalized recommendations.
Systematize cost‑operation processes and share the model with other data‑governance teams.
Deepen cost‑saving initiatives for real‑time and platform‑tool workloads.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
