Big Data 18 min read

Metadata Governance and Collection in a Data Asset Platform

The platform implements comprehensive metadata governance by extracting, standardizing, and ingesting basic, trend, resource, lineage, and task metadata from offline and real‑time systems via a Kafka‑based SDK, enabling unified storage, monitoring, alerts, and future automation to improve data asset visibility and quality.

Youzan Coder

Dec 25, 2020

Metadata Governance and Collection in a Data Asset Platform

Data asset governance requires comprehensive metadata that describes data types, volumes, and the full lifecycle of data flow. Collecting metadata is the core foundation for managing data assets.

The early collection system focused on data warehouse metadata via direct API connections to Hive and MySQL tables. As business needs grew, metadata collection expanded to cover the entire data chain, including offline and real‑time platforms, internal tools, and task metadata, encountering challenges such as diverse data categories, numerous platform components, long collection cycles, and low integration efficiency.

What is Metadata?

Metadata is "data that describes data." For example, a photo’s EXIF information (filename, size, resolution, camera model, etc.) is metadata. In the asset governance platform, metadata for Hive tables includes table name, field list, owner, and scheduling information.

Collecting full‑link metadata helps answer questions like: What data exists? Who uses it? How much storage does it occupy? How can it be discovered? How does it flow? Lineage information enables impact analysis.

Types of Collected Metadata

Basic metadata: table name, description, fields, owner, business domain, cluster, project.

Trend data: size, row count, file count, partition count, job duration, production time.

Resource data: cluster throughput, QPS, CPU/memory consumption.

Lineage data: upstream/downstream table/field dependencies and task input‑output relationships.

Task data: offline/real‑time job name, owner, deadline alerts, scripts, configuration.

To date, the platform covers over 10 data types and more than 100,000 basic metadata entries across offline components (Hive/MySQL), real‑time components (Flume, Kafka, HBase, Kylin, ES, Presto, Spark, Flink), and internal tools (BI reporting, metric library, OneService, QA systems).

Metadata Extraction

Extraction methods include:

Accessing Metastore to retrieve basic metadata stored in relational databases.

Fetching component resource metrics from monitoring services (e.g., Kafka, KP, RP, RDS) and aggregating them.

Obtaining business metrics from internal platforms (KP, RP, RDS, DP) that expose APIs for tables, tasks, and resource usage.

Collecting lineage data from DP and RP platforms by parsing task configurations or using an ANTLR4‑based SQL parser.

Metadata Collection SDK Design

The SDK supports reporting basic, trend, and lineage metadata and consists of a client side and a server side.

Architecture

The client defines generic schemas (MetaSchema, TrendSchema, LineageSchema) and pushes data to Kafka via ReportService. The server consumes Kafka messages, authenticates each record using appId/appName/token signatures, and routes data through adapters to a unified ingestion service that writes to MySQL and Elasticsearch.

Common Models

Unified metadata models are derived from Hive’s table model and include:

Common metadata model: owner info, table basics, business domain, extensions.

Common trend model: table definition, trend metrics, extensions.

Common lineage model: nodes (tables or tasks) and edges (dependencies) with extensible JSON fields.

Java model definitions:

@Data
public class TableLineageSchema<T extends TableNode> {
    private T current;
    private List<T> parents;
    private List<T> childs;
    private String extParam;
}

@Data
public class JobLineageSchema<Job extends JobNode, Table extends TableNode> {
    private Job task;
    private List<Table> inputs;
    private List<Table> outputs;
    private String extParam;
}

Monitoring and Alerting

Three levels of service interfaces (core, important, normal) are annotated with owners. Exceptions trigger phone alerts for core services and email alerts for others. Daily service reports are sent to owners.

Example alert log:

[Warning][prod][data-dict] - 数据资产平台告警
你负责的[元信息采集]模块(backup为XXX)出现[重要]等级问题, 方法名:[com.youzan.bigdata.crystal.controller.HiveMetaController.getHiveDb], 异常信息:null
host:XXXXXX
处理地址：https://XXXX

Kafka backlog alerts are also configured to detect SDK ingestion issues.

Future Work and Outlook

Planned improvements include:

Automated metadata collection via work‑order integration.

Enhanced task management (search, enable/disable).

Higher metadata quality assurance.

Support for business‑level metadata and unstructured data.

The team continues to recruit talent for data platform development, data warehousing, product, and algorithm roles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring SDK Big Data data collection metadata Data Governance

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.