Big Data 12 min read

Data Lineage System Design and Implementation for Big Data Platforms

This article presents a comprehensive data lineage system (Data-Lineage) for big data platforms, addressing challenges in heterogeneous data sources, multiple execution engines, and complex dependencies through hook-based architecture and modular design.

Beijing SF i-TECH City Technology Team
Beijing SF i-TECH City Technology Team
Beijing SF i-TECH City Technology Team
Data Lineage System Design and Implementation for Big Data Platforms

This paper introduces a comprehensive data lineage system designed to track data flow across complex big data platforms. The system addresses challenges in heterogeneous data sources, multiple execution engines, and complex dependencies through a hook-based architecture.

The paper begins by establishing the context of data lineage as a critical component for data provenance, quality assessment, and metadata management. It describes how data flows through various layers in a typical big data platform, from raw data sources through processing stages to final consumption.

The proposed architecture consists of four main modules: Hook, Collector, Lineage, and Common modules. The Hook module uses plugin-based development to intercept execution engine operations and extract lineage information. The Collector module receives and processes lineage data from various sources. The Lineage module provides query interfaces and SQL parsing capabilities. The Common module offers shared utilities and a custom logging framework.

Technical implementations are detailed for Hive, DataX, Flink, and Impala execution engines. For Hive, the system leverages Post-execution Hooks to capture query plans and extract source/destination table information. For DataX, it implements custom hook functions to parse job configurations. For Flink, it modifies the source code to add hook functionality. For Impala, it parses built-in lineage logs.

The system employs a factory pattern for handling different data source types and uses HTTP-based communication between modules. It includes SQL parsing capabilities for permission verification and metadata management. The Common module provides entity classes, exception handling, enums, utilities, and a custom logging framework specifically designed for hook operations.

Future work includes extending support to additional data sources like MySQL, Oracle, and Kafka, as well as implementing data tagging and popularity analysis based on lineage data.

Data Qualitydata lineageSQL parsingmetadata managementbig data architecturedata provenanceexecution engine integrationhook mechanism
Beijing SF i-TECH City Technology Team
Written by

Beijing SF i-TECH City Technology Team

Official tech channel of Beijing SF i-TECH City. A publishing platform for technology innovation, practical implementation, and frontier tech exploration.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.