Big Data 16 min read

Youzan Data Governance System Overview

Youzan’s data governance system treats petabyte‑scale data as a strategic asset, employing comprehensive collection, standardized management, lineage tracing, quality monitoring, and security auditing, then quantifies asset value through grading and dashboards, and delivers actionable services such as impact analysis and notifications, while addressing onboarding costs and value evaluation challenges.

Youzan Coder
Youzan Coder
Youzan Coder
Youzan Data Governance System Overview

Background Introduction

The concept of big data has been around for more than a decade, spawning countless theories, technologies, and practices. Under rapid growth, the demand for "data governance" has risen sharply. In 2019, the Ministry of Industry and Information Technology emphasized strengthening data governance as part of the national big data development strategy. In the AI‑driven future, effective data governance ensures high‑quality, efficient management of large‑scale data.

Data governance, in simple terms, means "treating data" – improving data quality, stability, accuracy, lifecycle control, and cost reduction, while also organizing basic information, status, and relationships of data generated in complex business scenarios.

Youzan, after seven years of SaaS services, has accumulated petabytes of data across industries and is increasingly focusing on data governance to unlock data value and support business growth.

Youzan Data Governance System

The system follows three pillars: data assetization, asset quantification & operation, and value realization.

2.1 Data Assetization

Unmanaged data is wasteful. Youzan treats data as an asset, asking: who are you, where do you come from, where are you going, and what is your purpose? Data is often transformed into different media to meet diverse business needs.

Typical scenario: user behavior analysis via event logs passes through multiple systems and tasks, generating various data types. Governance starts by identifying what data exists, who owns it, how it is produced, and whether it is effectively used.

2.1.1 Data Collection

Data collection aims for completeness – capturing all types and volumes of data. Two methods are used: scheduled interface pulls (non‑intrusive, lower timeliness) and SDK reporting (higher timeliness, requires correct usage). Both require generic, extensible interfaces.

All data types are abstracted as "tables" with common fields (name, size, owner, etc.) and extensible JSON fields. Additional tables store common metadata such as production time, latency, and change logs.

2.1.2 Data Management

Management includes standardized naming, business‑domain association, tag management, monitoring of large tables and volatility, and unified indexing in Elasticsearch for search by type, name, comment, domain, tags, etc. Features like "follow/collect" and heat‑map statistics aid data usage.

2.1.3 Lineage Flow

Beyond individual data points, lineage connects data production and consumption across tasks and applications. This is divided into lineage collection and application management.

Lineage collection (tables, fields, tasks) – automatic parsing for HQL tasks and manual specification for scripts, Flink jobs, etc.

Application management – currently manual, under exploration for more efficient solutions.

2.1.4 Quality Monitoring

Data quality is the primary value indicator. Youzan provides table‑level and field‑level validation (e.g., uniqueness of order IDs, statistical bounds on user behavior) with alerting. System‑wide checks include volatility monitoring.

Quality assessment covers accuracy (validation failures), timeliness (production latency), conformity (naming standards), and adoption (usage, follows). Continuous improvement is driven by targeted projects, incentives, and dashboards.

2.1.5 Security Audit

Security is treated as the lifeline of data. Measures include:

Sensitive data identification (automatic + manual labeling).

Permission control at table and field levels, with masking for sensitive fields.

Operation audit logs with query tools.

Cross‑cluster data backup.

Strict data export authorization and internal review processes.

2.2 Asset Quantification & Operation

After centralizing assets, their status and value can be quantified similar to a balance sheet. Metrics include asset grading (importance‑based), security grading (confidentiality levels), quality scores, and dashboards for quality, security, and cost.

Personal workbenches provide individual views of asset health, cost, security, and quality.

2.3 Realizing Data Value

Data becomes valuable when transformed into actionable services. Youzan offers:

Data maps – visualizing end‑to‑end data flow.

Impact analysis – assessing upstream/downstream dependencies.

Value analysis – combining quality, usage, cost, and business importance.

Critical path analysis – identifying bottlenecks in complex data pipelines.

One‑click notifications – informing stakeholders of data migrations or deprecations.

Additional potentials include circular dependency checks, zombie data analysis, energy consumption analysis, regression testing, and industry insights.

3. Summary, Challenges & Plans

Collected and managed ~10 w data types, ~10 w lineage relationships.

Established basic security mechanisms covering ~10 sensitive data categories.

Implemented a quality evaluation system and improvement workflow.

Provided initial quantification dashboards (grading, quality, security, cost).

Delivered preliminary analysis services and tools, continuously expanding.

Key challenges ahead:

Reducing data onboarding cost (currently requires front‑end, back‑end, and data adaptation).

Objectively evaluating data value.

Fully exploiting data value.

4. Conclusion

Data governance is an emerging field gaining attention at both national and enterprise levels. Youzan continues to explore and advance its practices.

Recruitment notice: Youzan’s big‑data team is hiring for components, platform systems, data warehouses, data products, and algorithms. Send resumes to [email protected] .

Extended Reading

Real‑time Computing Practice at Youzan – Efficiency Improvement

Youzan Data Warehouse Metadata System Practice

How We Redesigned the NSQ – Features and Future Plans

HBase Write Throughput Quantitative Analysis and Optimization

Youzan Big Data Platform Security Construction Practice

Flink Practice in Youzan Real‑time Computing

Big Data Development Platform (Data Platform) Best Practice at Youzan

Youzan Data Middle‑Platform Construction Practice

big dataData QualitySecuritydata managementdata governanceYouzanData Asset
Youzan Coder
Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.