Big Data 17 min read

Youzan Data Governance: Quality Assurance, Cost Management, and Operational Practices

This article explains Youzan's data governance framework, covering the definition of data governance, the company's asset‑centric approach, quantitative quality scoring, cost‑based pricing formulas, billing and allocation mechanisms, continuous operational improvements, and the measurable outcomes achieved.

DataFunTalk

Mar 6, 2021

Youzan Data Governance: Quality Assurance, Cost Management, and Operational Practices

1. What is Data Governance? Data governance refers to the systematic management of large‑scale data generated by complex business scenarios, focusing on data quality, stability, accuracy, lifecycle control, and cost reduction.

2. How Youzan Implements Data Governance

• Data Assetization : Collect and manage data, monitor quality and conduct security audits, treating all data artifacts as assets.

• Data Quantification & Operations : Evaluate asset and security levels, calculate quality scores and costs, and provide personal dashboards for users to view their data assets.

• Value Extraction : Use data maps to discover valuable data, enable key‑path analysis, one‑click notifications, and industry insights.

The current stage is quantitative operation, where quality and cost directly impact business applications.

3. Quality Assurance System

3.1 Data Quality includes both narrow (content accuracy) and broad dimensions such as accuracy, conformity, timeliness, and acceptance. A product called "Quality Score" aggregates these aspects.

3.2 Content Quality Checks are tightly coupled with tasks; after each task, automatic predefined checks (e.g., volume fluctuations, uniqueness) and custom checks (non‑null, range, SQL) are performed. Abnormal results trigger alerts via email, WeChat, or phone, and may block downstream tasks.

3.3 Quality Score combines conformity, acceptance, accuracy, and timeliness. Each factor has a weight; the total score is computed by a weighted sum. Example weighting adjustment: if orange‑box weights sum to 50, they are normalized to 100.

3.4 Improvement Measures

• Prevention : DDL entry restrictions, deadline warnings, static checks.

• Anomaly Detection : Alerts for task timeouts or validation failures.

• Quality Dashboard : Visualizes scores to drive awareness and optimization.

• Operational Promotion : Incentives, feedback loops, and continuous monitoring encourage users to improve quality.

4. Cost Reduction Mechanism

4.1 Cost Quantification – Assign unit prices to resources (CPU, memory, disk) based on total cost, resource weight, and load. Example formulas:

cpu_price = total_cost * cpu_weight / (total_cpu * cpu_load)

memory_price = total_cost * memory_weight / (total_memory * memory_load)

task_cost = cpu_price * used_cpu + memory_price * used_memory + disk_price * used_disk

Resource pricing considers total cost, scarcity, and reasonable utilization levels (e.g., 80% CPU load).

4.2 Cost Billing – Provides multi‑view analysis (personal, department, business line, global) with trend charts, drill‑down, and cost distribution breakdowns. Supports cost‑saving suggestions such as task de‑duplication, delayed start, and schedule optimization.

4.3 Cost Allocation – Splits costs between exclusive (dedicated clusters) and shared (data‑warehouse middle layer) components, allocating them to business lines based on order volume or usage ratios.

4.4 Continuous Operations – Reinforces cost awareness through posters, videos, reward mechanisms (e.g., internal "Youzan Coins"), feedback channels, and regular tracking of cost‑saving actions.

5. Operational Results

After six months, over 32% of cost‑bearing users took cost‑reduction actions, with more than 1,400 actions recorded and a 38% self‑service rate. The initiative cleared >2 PB of data, decommissioned >300 tasks, and saved >3 million RMB annually.

6. Summary & Outlook

Data governance revolves around three pillars: quantification, product support, and operation. Future directions include expanding quality coverage to real‑time data (Kafka, HBase), improving service reliability, deepening business‑level quality, achieving full‑cost visibility for upstream MySQL tables, and strengthening operational investment to sustain high quality at low cost.

Final motto: "High quality, low cost, making data more valuable."

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Quality Data Platform cost optimization

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.