Big Data 20 min read

Construction and Application of Tencent Oula Data Lineage Platform

This article presents a comprehensive overview of Tencent Oula's data lineage system, detailing its background, goals, architecture, modular construction, key technologies such as graph databases and SQL parsing, and various internal application scenarios including data governance, cost insight, and baseline monitoring.

DataFunTalk
DataFunTalk
DataFunTalk
Construction and Application of Tencent Oula Data Lineage Platform

The article introduces the background and objectives of Tencent Oula's data lineage module, which serves the three sub‑products of the Oula data platform: asset factory, governance engine, and data discovery.

It explains why data lineage is needed, describing current limitations such as insufficient coverage, coarse granularity, and lack of advanced graph mining models, and outlines the goals of expanding both breadth (covering production, processing, and application) and depth (task, table, field, and value lineage).

The architecture section details the selection of technologies, opting for an internally built solution based on EasyGraph, Elasticsearch, and Meepo rather than open‑source options like Apache Atlas, and describes the data flow from UniMeta metadata ingestion through ETL processing, SQL parsing (using a custom engine built on Calcite and ANTLR), graph construction with GraphX, and storage in graph and KV stores.

Modular construction is covered, including a unified UID scheme for entities, decoupled node and edge management, atomic relationship modeling, and the use of AST‑level parsing to generate fine‑grained lineage graphs.

Application scenarios are enumerated, highlighting data governance (cold table/field cleanup, task optimization), lineage queries, data‑warehouse development (SQL visualization and inefficiency detection), baseline projects for task delay monitoring, and full‑link cost insight that allocates upstream costs to downstream nodes.

The Q&A section addresses practical concerns such as the recommended construction order (task → table → field), accuracy evaluation methods, handling of JDBC sources, target user roles (data engineers and analysts), support for Spark DataFrame lineage, underlying data structures and algorithms, and plans for external exposure of the Oula platform.

Big Datagraph databaseData LineageSQL parsingData GovernanceMetadata Managementcost analysis
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.