Industry Insights 18 min read

How to Build Accurate Data Asset Lineage for Data Warehouse Governance

This article explains the challenges of data asset lineage in large data warehouses, presents a comprehensive approach using business‑level instrumentation, SQL interceptor plugins, and ETL script parsing to generate fine‑grained lineage graphs, and demonstrates measurable improvements in coverage and zombie‑table cleanup.

HomeTech

Nov 15, 2023

How to Build Accurate Data Asset Lineage for Data Warehouse Governance

1. Background

As the automotive‑home OEM business expands, the volume of data assets and the complexity of the data warehouse increase, leading to non‑standard data models, difficult traceability, and analysis problems caused by insufficient lineage analysis. Strengthening lineage relationships is essential for effective data‑warehouse modeling, governance, and maximizing data value.

2. Problems Encountered

Inconsistent naming of data assets due to lack of enforced naming rules.

Missing dependencies in ETL task scheduling, resulting in incomplete or inaccurate lineage.

Multiple ETL script types (shell, Python, PySpark, HQL) increase parsing difficulty.

HQL files lack unified coding standards, with multiple INSERT statements and missing database names.

Unclear data asset usage at the application layer, making it hard to link assets to business scenarios.

3. Analysis Approach

3.1 Business Instrumentation

Business instrumentation embeds AOP‑based code to capture table usage and call counts, solving the unclear application‑layer asset usage problem.

a. Overall Flow

![Flow Diagram](https://mmbiz.qpic.cn/mmbiz_png/UPBmkBoiaVwtBQ0NjaCx0uNG41qv1QRj8uVjcibBdymPFt0CMGSbsu7Qwdiaxh6iaJqhKRSMb3uX957rib1Cvib0LmYQ/640)

The backend service injects a MyBatis interceptor via AOP, extracting the interceptor into a separate module to reduce coupling. The interceptor centralizes SQL handling, allowing easy addition or removal of plugins without affecting business modules.

b. Usage

<dependency>
    <groupId>com.autohome.index</groupId>
    <artifactId>common-sqlplugin</artifactId>
    <version>1.1-SNAPSHOT</version>
</dependency>

c. SQL Interceptor Implementation

@Signature(type = Executor.class, method = "query", args = {MappedStatement.class, Object.class, RowBounds.class, ResultHandler.class, CacheKey.class, BoundSql.class})
public class SqlPluginInterceptor implements Interceptor {
    @Override
    public Object intercept(Invocation invocation) throws Throwable {
        MappedStatement mappedStatement = (MappedStatement) invocation.getArgs()[0];
        Object parameter = null;
        if (invocation.getArgs().length > 1) {
            parameter = invocation.getArgs()[1];
        }
        BoundSql boundSql = mappedStatement.getBoundSql(parameter);
        long startTime = System.currentTimeMillis();
        try {
            return invocation.proceed();
        } finally {
            try {
                // Store SQL and table info into Redis
                saveSqlToRedis(retSQL, sqlCost, mappedStatement.getId());
            } catch (Exception e) {
                System.out.println("sql interceptor error!");
                e.printStackTrace();
            }
        }
    }
}

3.2 ETL Task Parsing

The ETL parsing pipeline handles both regular HQL scripts and unconventional scripts (shell, Python, PySpark). Regular scripts are parsed into lineage and stored in MySQL; failures are recorded as unconventional tasks, triggering alert emails and manual handling (less than 5% of total).

a. Task Parsing Flow

![Task Flow](https://mmbiz.qpic.cn/mmbiz_png/UPBmkBoiaVwtBQ0NjaCx0uNG41qv1QRj8ibcUiaXcTlcOgsZpAtsW0AojltbnLkE50iaxbeotEjw6gMic97jaGZriaoQ/640)

Regular ETL scripts are parsed via an SQL parser to extract lineage and write to MySQL. Unconventional scripts are aggregated, emailed for review, and processed manually.

try:
    file_content = hql_parse_util.parse_hql(hqlfile)
    table_dict = hql_parse_util.parse_output(file_content.lower())
    for out_put_table in table_dict:
        input_tables = table_dict[out_put_table]
        if out_put_table == 'tbnames':
            out_put_table = '.'
        for input_table in input_tables:
            if input_table.startswith('--'):
                continue
            sql_val.append([hqlfile, out_put_table.split('.',1)[0], out_put_table.split('.',1)[1],
                            input_table.split('.',1)[0], input_table.split('.',1)[1], uid, name, 'hive', job_status, date_time])
    cnt = cursor.executemany(insert_sql, sql_val)
    print('Inserted regular tasks: ' + str(cnt) + ' rows')
    conn.commit()
finally:
    cursor.close()

3.3 SQL Statement Analysis

SQL parsing distinguishes source tables (including WITH blocks) and target tables (INSERT statements). The parser extracts table names, resolves database names via USE statements, and maps WITH aliases to actual tables.

With‑block Parsing

def sql_parse(sql):
    sql = sql.lower()
    with_tb_dict = {}
    sql = re.sub('with[-_a-z]', '', sql)
    if 'with' in sql:
        with_all_arr = sql.split('with')[1:]
        for with_content in with_all_arr:
            # Extract alias and inner SELECT tables
            ...
    print(with_tb_dict)

Table Name Extraction

def sql_parse_get_tablenames(sql_query):
    table_names = re.findall(r'from\s+(\S+)', sql_query, re.IGNORECASE)
    table_names += re.findall(r'join\s+(\S+)', sql_query, re.IGNORECASE)
    tb_names = []
    for name in table_names:
        if '(' not in name:
            tb_names.append(name.replace(')', '').replace('`', '').replace(';', ''))
    return list(set(tb_names))

USE‑statement Handling

if 'use ' in sql:
    use_arrs = sql.split('use ')[1:]
    for arr in use_arrs:
        db_name = arr.split(';')[0]
        if 'insert' in arr:
            insert_arr = arr.split('insert ')[1:]
            for tb in insert_arr:
                if 'into' in tb:
                    insert_tb = tb[tb.index('into ') + len('into '):tb.index('
')].strip()
                else:
                    insert_tb = tb[tb.index('table ') + len('table '):tb.index('
')].strip()
                hive_target_tb = insert_tb.split(' ')[0]
                if '.' not in hive_target_tb:
                    hive_target_tb = db_name + '.' + hive_target_tb
                # further mapping logic ...

4. Analysis and Governance Results

4.1 Lineage Coverage Metric

A quantitative metric, "lineage coverage", measures the proportion of assets with at least one lineage link. Calculation steps:

Define the set of assets to monitor.

Identify lineage relationships via SQL parsing.

Compute coverage = (number of covered assets) / (total assets).

Compare coverage with other indicators to spot uncovered assets and missing links.

![Coverage Chart](https://mmbiz.qpic.cn/mmbiz_png/UPBmkBoiaVwtBQ0NjaCx0uNG41qv1QRj8Alm0GJDPjfTCR7fIFeicmbMK2hgDE9Uo9N6doJOQ7PITfrLMQt7EbFg/640)

4.2 Zombie‑Table Governance

By counting table usage frequency from business instrumentation and comparing with the full catalog, tables unused for long periods ("zombie tables") are identified. The cleanup process backs up data, drops the table, waits a month for any issues, then permanently deletes from backup. Approximately ten zombie tables are removed weekly.

![Zombie Table Chart](https://mmbiz.qpic.cn/mmbiz_png/UPBmkBoiaVwtBQ0NjaCx0uNG41qv1QRj8Alm0GJDPjfTCR7fIFeicmbMK2hgDE9Uo9N6doJOQ7PITfrLMQt7EbFg/640)

5. Summary

Data asset lineage analysis helps organizations understand data flow, assess risks, and improve governance. The process includes defining goals, collecting assets, parsing relationships, analyzing lineage, assessing risk, generating reports, and providing optimization recommendations.

6. Future Plans

Upcoming work will focus on data‑quality assurance, multi‑source integration, and building visual lineage graphs. Specific actions include establishing stricter data‑quality policies, integrating internal and external data sources, continuously optimizing the lineage model, visualizing relationships, and defining standards for reproducible lineage analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data quality Data Warehouse Data Lineage SQL interceptor Data Governance ETL parsing

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.