Big Data 16 min read

Kuaishou Metadata Platform: Evolution, Architecture, and Application Scenarios

This article introduces the development history, current architecture, abstraction methods, and key application scenarios of Kuaishou's metadata platform, highlighting challenges such as heterogeneous data integration, large-scale asset management, and the platform's role in data search, lineage, governance, and future enhancements.

DataFunTalk

Jan 21, 2021

Kuaishou Metadata Platform: Evolution, Architecture, and Application Scenarios

Background Introduction

Metadata is information about data organization, domains, and relationships; in this article we focus on metadata of data assets generated during big‑data production, such as tables, jobs, and lineage.

Challenges in Building a Metadata Platform

End‑to‑end metadata integration across data collection, ETL, and consumption.

Heterogeneous metadata and complex relationships across dozens of platforms and billions of entities.

Extracting value from metadata by collaborating with business and data teams for governance and model evaluation.

Why Build a Metadata Platform?

Different business stages expose problems that require a robust metadata system, including fast data discovery, accurate upstream/downstream lineage, governance drivers, and asset management (ownership, classification, privacy).

Construction Process and Current Status

1. Metadata Platform Evolution

The platform evolved in three stages:

Early stage (pre‑2018): Only Hive engine, a few thousand tables, simple MySQL sync via PostHook.

Growth stage (2018‑2019): Multiple compute engines, rapid table and job growth, introduction of ES for search, offline lineage construction.

Current stage (post‑2020): Over ten asset types, hundreds of thousands of tables and tasks, offline warehouse for governance, real‑time SQL‑based lineage, knowledge‑map for onboarding.

2. Abstraction and Management

Metadata is abstracted into core concepts:

Concept

Description

Example

Entity

An instance of a metadata type with a unique identifier and attributes.

Hive table, metric, scheduling task

Attribute

Basic unit of an entity, can be simple or complex.

Table name, metric type, security level

Relation

Link between two entities, physical or logical.

Table‑task relation, metric binding

URN

Three‑part globally unique identifier.

ks:hive/table:db/table

Attributes are classified into four categories:

Basic metadata from engines/platforms.

Asset metadata maintained by developers.

Security metadata from security center.

Derived metadata computed from other attributes.

3. Current System Architecture

The system consists of an ingestion layer, service layer, and storage layer (graph model). Ingestion adapts various producers, normalizes data, and emits change messages. Services provide point‑and‑complex queries and analytical capabilities. The storage layer supports graph queries, statistics, and analysis.

Application Scenarios

1. Data Discovery (Search)

Metadata is indexed in Elasticsearch to support keyword search across basic info (field names, dimensions, timestamps), descriptive info (Chinese name, description), and relationship info (task links, bindings, tags). After coarse recall, three fine‑ranking rules are applied: metadata completeness, downstream dependency count, and operational rules. Metrics such as zero‑click rate, average click rank, and negative feedback rate evaluate search quality.

2. Full‑Link Lineage

Lineage relationships are captured from production task lifecycle events and custom platform reports, parsed via SQL or user‑defined rules to extract input‑output links, even field‑level dependencies. The lineage service stores these in the graph engine and supports queries for data and task lineage, impact analysis, priority inference, and decommission checks.

3. Data Governance Platform

Governance addresses resource waste, non‑standard production, missing metadata, and quality monitoring. A scoring system evaluates data assets across four dimensions (data standards, model design, product delivery, resource utilization) using 19 metrics, producing a leaderboard to drive continuous improvement.

4. Other Scenarios

Metadata queries for development platforms, metric models, BI tools.

Asset management with lifecycle, security level, and ownership.

Impact analysis for downstream propagation.

Value assessment of data assets.

Future Plans

Enhanced search experience leveraging graph queries.

Higher‑quality metadata through broader ingestion and enrichment.

Offline analytical capabilities on the new metadata store.

Finer‑granularity lineage (field‑level, sub‑field) with improved accuracy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Metadata data lineage Search Kuaishou

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.