Metadata Governance and Collection in a Data Asset Platform
The platform implements comprehensive metadata governance by extracting, standardizing, and ingesting basic, trend, resource, lineage, and task metadata from offline and real‑time systems via a Kafka‑based SDK, enabling unified storage, monitoring, alerts, and future automation to improve data asset visibility and quality.
Data asset governance requires comprehensive metadata that describes data types, volumes, and the full lifecycle of data flow. Collecting metadata is the core foundation for managing data assets.
The early collection system focused on data warehouse metadata via direct API connections to Hive and MySQL tables. As business needs grew, metadata collection expanded to cover the entire data chain, including offline and real‑time platforms, internal tools, and task metadata, encountering challenges such as diverse data categories, numerous platform components, long collection cycles, and low integration efficiency.
What is Metadata?
Metadata is "data that describes data." For example, a photo’s EXIF information (filename, size, resolution, camera model, etc.) is metadata. In the asset governance platform, metadata for Hive tables includes table name, field list, owner, and scheduling information.
Collecting full‑link metadata helps answer questions like: What data exists? Who uses it? How much storage does it occupy? How can it be discovered? How does it flow? Lineage information enables impact analysis.
Types of Collected Metadata
Basic metadata: table name, description, fields, owner, business domain, cluster, project.
Trend data: size, row count, file count, partition count, job duration, production time.
Resource data: cluster throughput, QPS, CPU/memory consumption.
Lineage data: upstream/downstream table/field dependencies and task input‑output relationships.
Task data: offline/real‑time job name, owner, deadline alerts, scripts, configuration.
To date, the platform covers over 10 data types and more than 100,000 basic metadata entries across offline components (Hive/MySQL), real‑time components (Flume, Kafka, HBase, Kylin, ES, Presto, Spark, Flink), and internal tools (BI reporting, metric library, OneService, QA systems).
Metadata Extraction
Extraction methods include:
Accessing Metastore to retrieve basic metadata stored in relational databases.
Fetching component resource metrics from monitoring services (e.g., Kafka, KP, RP, RDS) and aggregating them.
Obtaining business metrics from internal platforms (KP, RP, RDS, DP) that expose APIs for tables, tasks, and resource usage.
Collecting lineage data from DP and RP platforms by parsing task configurations or using an ANTLR4‑based SQL parser.
Metadata Collection SDK Design
The SDK supports reporting basic, trend, and lineage metadata and consists of a client side and a server side.
Architecture
The client defines generic schemas (MetaSchema, TrendSchema, LineageSchema) and pushes data to Kafka via ReportService. The server consumes Kafka messages, authenticates each record using appId/appName/token signatures, and routes data through adapters to a unified ingestion service that writes to MySQL and Elasticsearch.
Common Models
Unified metadata models are derived from Hive’s table model and include:
Common metadata model: owner info, table basics, business domain, extensions.
Common trend model: table definition, trend metrics, extensions.
Common lineage model: nodes (tables or tasks) and edges (dependencies) with extensible JSON fields.
Java model definitions:
@Data
public class TableLineageSchema<T extends TableNode> {
private T current;
private List<T> parents;
private List<T> childs;
private String extParam;
} @Data
public class JobLineageSchema<Job extends JobNode, Table extends TableNode> {
private Job task;
private List<Table> inputs;
private List<Table> outputs;
private String extParam;
}Monitoring and Alerting
Three levels of service interfaces (core, important, normal) are annotated with owners. Exceptions trigger phone alerts for core services and email alerts for others. Daily service reports are sent to owners.
Example alert log:
[Warning][prod][data-dict] - 数据资产平台告警
你负责的[元信息采集]模块(backup为XXX)出现[重要]等级问题, 方法名:[com.youzan.bigdata.crystal.controller.HiveMetaController.getHiveDb], 异常信息:null
host:XXXXXX
处理地址:https://XXXXKafka backlog alerts are also configured to detect SDK ingestion issues.
Future Work and Outlook
Planned improvements include:
Automated metadata collection via work‑order integration.
Enhanced task management (search, enable/disable).
Higher metadata quality assurance.
Support for business‑level metadata and unstructured data.
The team continues to recruit talent for data platform development, data warehousing, product, and algorithm roles.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Youzan Coder
Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
