Big Data 16 min read

Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications

This article describes how Ctrip built a data lineage system for its big data platform, covering the concept of data lineage, collection methods, open‑source tools such as Apache Atlas and DataHub, the in‑house table‑level and field‑level solutions, implementation details for Hive, Spark and Presto, storage in JanusGraph, and practical applications in data governance, metadata management, scheduling and sensitivity labeling.

Ctrip Technology

Sep 9, 2021

Building Data Lineage at Ctrip: Architecture, Implementation, and Real‑World Applications

Data lineage records the origin, transformations, and movement of data over time, providing visibility that simplifies root‑cause analysis in analytics pipelines.

In the era of big data, lineage becomes a crucial component of metadata management, data governance, and data quality, enabling traceability of source tables, ETL jobs, and downstream reports.

Typical uses of data lineage include tracing data provenance for error analysis, evaluating data value based on usage metrics, visualizing data lifecycle, and enforcing security policies by propagating sensitivity tags downstream.

Approaches to Collect Lineage

Option 1: Collect raw SQL statements after execution and analyze them later. This works when the number of engines and syntax variations is limited, using Hive’s built‑in LineageLogger to parse table‑ and column‑level relationships.

Option 2: Perform runtime analysis of SQL, sending lineage information to Kafka immediately. This yields more accurate, real‑time data but requires custom parsers for each engine.

Open‑Source Solutions

Apache Atlas: Provides metadata governance for Hadoop ecosystems, supporting Hive, HBase, Sqoop, Storm, Kafka, and Falcon. It captures lineage via hooks that send data to Kafka, stores relationships in JanusGraph, and exposes a REST API. Hive Hook supports table‑ and column‑level lineage; Spark support is limited to table‑level.

LinkedIn DataHub (formerly WhereHows): Aggregates metadata from multiple sources, offers a demo instance, integrates well with Airflow, and supports dataset‑level lineage with field‑level roadmap planned for Q3 2021.

Ctrip’s Own Solution

Ctrip adopted the runtime collection method (Option 2) and built custom hooks for Hive, Spark, Presto, and the DataX transfer tool. The first version (2016‑2017) captured table‑level lineage using a Hive Hook that posted to Kafka; results were stored in Neo4j and exposed via a metadata service.

The second version (2019) added field‑level lineage across Hive, Spark, Presto, and DataX, writing relationships to the distributed graph database JanusGraph in real time.

Hive Implementation

Implemented a Hook that implements ExecuteWithHookContext, extracts input/output tables from the execution plan, and sends lineage events to Kafka. Issues addressed include incorrect start times, null execution plans, and handling DROP TABLE commands.

Spark Implementation

Developed a QueryExecutionListener that captures the logical plan on successful execution, extracts attribute IDs, and maps them to column‑level relationships. Challenges involved using the analyzed plan, handling hiveconf/hivevar variables, and dealing with DROP operations.

Presto Implementation

Created an EventListener plugin that parses QueryCompletedEvent. Resolved a class‑loading conflict with Kafka’s StringSerializer by temporarily nullifying the thread context class loader.

JanusGraph Storage

JanusGraph serves as the distributed graph store, backed by Cassandra for storage and Elasticsearch for indexing. Vertices are keyed by database+table+column to enable fast get‑or‑create operations. Relationship edges use two labels: WRITE and WRITE_TTL, the latter leveraging Cassandra’s cell‑level TTL for automatic expiration.

g.E().hasLabel("WRITE").has("x",eq("y")).has("publishedDate",P.lte(new Date(1610640000))).drop().iterate()

Performance optimizations include vertex‑id caching, batch deletions, and TTL‑based edge cleanup.

Coverage and Limitations

Supported engines: Hive, Spark (SQL CLI, Thrift Server, DataFrame/Dataset API), Presto.

Supported transfer tool: DataX.

Integrated with scheduling platform Zeus, ad‑hoc query platform, reporting platform, metadata platform, and GPU‑based PySpark platform.

Limitations: No lineage for MapReduce or Spark RDD jobs; HDFS audit‑log integration is a future direction.

Impact

First version displayed lineage as a graph, which became cluttered with many relationships; the second version switched to a tree‑structured table view for clearer navigation.

Lineage data drives data governance (e.g., cleaning unused temporary tables), metadata management (e.g., assessing table creation time, detecting anomalies, measuring usage heat), scheduling (e.g., recommending upstream dependencies, validating output tables), and automatic propagation of sensitivity tags from source databases.

Conclusion

Ctrip’s end‑to‑end data lineage system, evolving from table‑level to field‑level coverage and from batch to near‑real‑time, provides essential visibility for data governance, metadata management, quality assurance, and security compliance in a large‑scale big‑data environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Metadata Kafka Hive Spark JanusGraph

Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.