Youzan’s Blueprint: Data Governance, Quality Scoring, and Cost Reduction for AI
At Youzan, data governance evolves from massive data assets to AI readiness through systematic data assetization, quantitative quality scoring, cost measurement, and targeted operational tactics, enabling precise quality monitoring, cost allocation, and continuous improvement that drive both data value and cost efficiency.
1. Data Governance Overview
Data governance at Youzan treats all collected data—whether from systems or manual input—as valuable assets that must be managed for quality, stability, accuracy, lifecycle control, and cost reduction. The goal is to turn these assets into AI‑ready resources.
Data : Large‑scale data generated in complex business scenarios.
Govern : Focus on data quality, stability, and lifecycle cost control.
Manage : Clarify data origins, destinations, and usage.
The governance framework includes three core practices:
Data assetization: collecting, managing, and monitoring data quality and security to treat data as assets.
Quantification and operation: assigning asset and security levels, calculating quality scores and costs, and providing personal dashboards for data owners.
Value exploitation: using data maps to discover valuable data, perform key‑path analysis, notifications, and industry insights.
2. Quality Assurance System
What is data quality? It encompasses both narrow content accuracy and broader dimensions such as correctness, conformity, timeliness, and acceptance. Youzan built a product called “Quality Score” to evaluate these aspects.
Content quality verification is tightly coupled with tasks that generate data. Verification includes predefined checks (automatic, e.g., data volume fluctuations, file uniqueness) and custom checks (configured by data owners, e.g., non‑null, range, custom SQL).
Verification results trigger different actions: normal results allow downstream tasks; acceptable anomalies send email and WeChat alerts; unacceptable anomalies block downstream execution and trigger phone alerts.
Quality Score combines four dimensions—conformity, acceptance, accuracy, and timeliness—each weighted and summed to produce an overall score. Weight adjustments ensure the total reaches 100%.
Improvement measures include:
Prevention: DDL entry restrictions, deadline warnings, static checks before changes.
Anomaly detection: alerts for task timeouts or validation failures.
Quality dashboard: visualizes scores, prompts optimizations, and encourages usage.
Optimization rollout: operational actions beyond awareness, such as targeted improvements and continuous monitoring.
Impact : After applying the quality score, synonym table duplication dropped by 99%, business and ownership rates rose above 95%, failure‑prone rules decreased by over 95%, and overall failure rate fell from 11% to 1.25%.
3. Cost Reduction Operations
Asset cost quantification distributes infrastructure costs (CPU, memory, disk) to individual tables based on resource consumption. Resources are priced according to total cost, scarcity, and a reasonable utilization factor (e.g., 80% CPU load).
Key formulas:
cpu_price = total_cost * cpu_weight / (total_cpu * cpu_load) memory_price = total_cost * memory_weight / (total_memory * memory_load) cost = cpu_price * use_cpu + memory_price * use_memory + disk_price * use_diskCost billing provides three capabilities:
Analysis : multi‑dimensional views (personal, department, business domain, line, global), drill‑down, trend analysis.
Cost reduction : identifies data to decommission, suggests delayed start for low‑priority tasks, optimizes scheduling frequency, reduces data skew, and curtails unnecessary data volume.
Business line allocation : splits costs between exclusive (dedicated to a single business) and shared (e.g., data‑warehouse middle layer) to make business units aware of their consumption.
Cost sharing follows a three‑step process: set default sharing ratios based on order volume, adjust with platform‑level sharing, then incorporate exclusive tool costs to compute final allocation.
Continuous operation ensures the governance system stays active through:
Awareness campaigns (posters, short videos) to promote cost‑saving practices.
Incentive mechanisms (internal “Youzan coins” rewards).
Feedback loops (suggestion boxes, Q&A groups, monitoring dashboards).
Results after six months: over 32% of cost‑bearing users took cost‑saving actions, with 38% of those actions being self‑initiated; more than 2 PB of data and 300 tasks were cleaned up, saving over 3 million RMB annually.
4. Summary and Outlook
Data quality and cost form the three “horses” of data governance: quantification, product support, and operation. Quantification makes quality and cost visible; products enable analysis and cost‑saving actions; operations sustain the loop, turning data into high‑value, low‑cost assets for AI.
Future directions include expanding quality coverage beyond offline tables to real‑time streams (Kafka) and online stores (HBase), improving service reliability and latency, and extending cost quantification to upstream MySQL tables for finer‑grained sharing.
Key takeaways: high quality, low cost, and continuous operation are essential to unlock data value for AI applications.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
