Kuaishou Metadata Platform: Evolution, Architecture, and Application Scenarios
This article introduces the development history, current architecture, abstraction methods, and key application scenarios of Kuaishou's metadata platform, highlighting challenges such as heterogeneous data integration, large-scale asset management, and the platform's role in data search, lineage, governance, and future enhancements.
Background Introduction
Metadata is information about data organization, domains, and relationships; in this article we focus on metadata of data assets generated during big‑data production, such as tables, jobs, and lineage.
Challenges in Building a Metadata Platform
End‑to‑end metadata integration across data collection, ETL, and consumption.
Heterogeneous metadata and complex relationships across dozens of platforms and billions of entities.
Extracting value from metadata by collaborating with business and data teams for governance and model evaluation.
Why Build a Metadata Platform?
Different business stages expose problems that require a robust metadata system, including fast data discovery, accurate upstream/downstream lineage, governance drivers, and asset management (ownership, classification, privacy).
Construction Process and Current Status
1. Metadata Platform Evolution
The platform evolved in three stages:
Early stage (pre‑2018): Only Hive engine, a few thousand tables, simple MySQL sync via PostHook.
Growth stage (2018‑2019): Multiple compute engines, rapid table and job growth, introduction of ES for search, offline lineage construction.
Current stage (post‑2020): Over ten asset types, hundreds of thousands of tables and tasks, offline warehouse for governance, real‑time SQL‑based lineage, knowledge‑map for onboarding.
2. Abstraction and Management
Metadata is abstracted into core concepts:
Concept
Description
Example
Entity
An instance of a metadata type with a unique identifier and attributes.
Hive table, metric, scheduling task
Attribute
Basic unit of an entity, can be simple or complex.
Table name, metric type, security level
Relation
Link between two entities, physical or logical.
Table‑task relation, metric binding
URN
Three‑part globally unique identifier.
ks:hive/table:db/table
Attributes are classified into four categories:
Basic metadata from engines/platforms.
Asset metadata maintained by developers.
Security metadata from security center.
Derived metadata computed from other attributes.
3. Current System Architecture
The system consists of an ingestion layer, service layer, and storage layer (graph model). Ingestion adapts various producers, normalizes data, and emits change messages. Services provide point‑and‑complex queries and analytical capabilities. The storage layer supports graph queries, statistics, and analysis.
Application Scenarios
1. Data Discovery (Search)
Metadata is indexed in Elasticsearch to support keyword search across basic info (field names, dimensions, timestamps), descriptive info (Chinese name, description), and relationship info (task links, bindings, tags). After coarse recall, three fine‑ranking rules are applied: metadata completeness, downstream dependency count, and operational rules. Metrics such as zero‑click rate, average click rank, and negative feedback rate evaluate search quality.
2. Full‑Link Lineage
Lineage relationships are captured from production task lifecycle events and custom platform reports, parsed via SQL or user‑defined rules to extract input‑output links, even field‑level dependencies. The lineage service stores these in the graph engine and supports queries for data and task lineage, impact analysis, priority inference, and decommission checks.
3. Data Governance Platform
Governance addresses resource waste, non‑standard production, missing metadata, and quality monitoring. A scoring system evaluates data assets across four dimensions (data standards, model design, product delivery, resource utilization) using 19 metrics, producing a leaderboard to drive continuous improvement.
4. Other Scenarios
Metadata queries for development platforms, metric models, BI tools.
Asset management with lifecycle, security level, and ownership.
Impact analysis for downstream propagation.
Value assessment of data assets.
Future Plans
Enhanced search experience leveraging graph queries.
Higher‑quality metadata through broader ingestion and enrichment.
Offline analytical capabilities on the new metadata store.
Finer‑granularity lineage (field‑level, sub‑field) with improved accuracy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
