Industry Insights 19 min read

How Youzan Built a Full‑Scale Data Cost Billing System: From SDK to Multi‑Dimensional Analysis

This article details Youzan's end‑to‑end construction of a unified data‑center cost billing system, covering background goals, multi‑type cost support, SDK‑based information collection, cost quantification for offline, real‑time and platform tools, full‑business coverage, multi‑dimensional analysis models, operational rollout, and future plans.

Youzan Coder
Youzan Coder
Youzan Coder
How Youzan Built a Full‑Scale Data Cost Billing System: From SDK to Multi‑Dimensional Analysis

Background

In H1 2020 Youzan started a data‑center offline cost governance project and, in H2, expanded it to a global cost‑billing system that covers offline, real‑time, and platform‑tool workloads to improve cost transparency for the platform and its users.

Goals

Support multiple cost types (offline, real‑time, platform tools)

Cover all business lines and domains

Provide multi‑dimensional analysis methods

Enable operation across all channels

1. Multi‑type Cost Support

1.1 Cost Classification

Three service categories are defined:

Offline computation (e.g., Spark, Hive)

Real‑time computation (e.g., HBase, Druid, Kafka, Flink)

Platform tools (internal data‑R&D platform, asset‑governance platform, BI reporting platform)

1.2 Unified Data Collection SDK

A lightweight SDK collects hardware resources (CPU, memory, disk, NIC) and runtime metrics (task usage, client access, storage). The SDK provides a unified data model, Kafka‑based decoupling, and robust validation to avoid tight coupling and high maintenance costs.

1.3 Cost Quantification

1.3.1 Offline Cost

The offline cost calculation follows the methodology described in a previous article (reference omitted for brevity).

1.3.2 Real‑time Cost

The quantification process consists of three steps:

Collect total cost and total resource pool.

Derive resource unit prices using scarcity weighting and target utilization.

Combine SDK‑collected consumption data to compute task‑level costs.

Example for Flink:

cpu_price = total_cost * cpu_index / (total_cpu * load)
memory_price = total_cost * memory_index / (total_memory * load)
task_cost = cpu * cpu_price + memory * memory_price

1.3.3 Platform Tool Cost

Because platform‑tool cost is relatively low, a simple pricing based on the machine model is used. All figures are anonymized, focusing on trends rather than absolute values.

2. Full‑Business Coverage

Business domains are organized into three layers to improve coverage to over 90%:

Business lines – external products.

Generic business domains – data‑warehouse middle layers.

Platform/module domains – internal platform components.

2.1 Cost Allocation

Costs are split into shared (allocation) and exclusive (dedicated) portions. Allocation ratios are derived from daily order volume. A two‑stage allocation distributes costs to business lines, business domains, and platform tools.

3. Multi‑dimensional Analysis Capabilities

A data model ties dimensions (business line, department, resource type) with metrics (incurred cost, reducible cost, reduced cost), forming a typical OLAP scenario. Spark Cube is used to generate multi‑dimensional datasets that support dynamic queries.

3.1 Interaction Design

Long‑term trend analysis.

Drill‑down analysis to task‑level cost.

Cascade analysis where selecting an anomaly updates related visualizations.

3.2 Views

The final billing view offers five perspectives (global, business line, business domain, department, individual) and supports dynamic trends, drill‑down, and cascade analysis.

4. Operational Rollout

A technical‑operations approach was used to drive adoption, including product launch videos, online/offline promotion channels, and targeted reminders (e.g., pop‑up prompts for partition lifecycle configuration). These actions increased click‑through rates and overall participation in cost‑saving actions.

5. Phase Summary

The team delivered:

A unified SDK for data collection.

A flexible data model for cost quantification across offline, real‑time, and platform‑tool workloads.

A multi‑dimensional query service supporting five analysis views and three analysis methods.

6. Future Plans

Extend coverage to all data‑center components as the ecosystem evolves.

Introduce intelligent cost anomaly detection and personalized recommendations.

Systematize cost‑operation processes and share the model with other data‑governance teams.

Deepen cost‑saving initiatives for real‑time and platform‑tool workloads.

SDKbig dataoperationsdata platformcost analysisindustry insightsdata-cost
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.