Big Data 12 min read

How Ele.me Built a Scalable Metadata Governance System for Big Data

This article explains how Ele.me tackles big‑data challenges by designing a metadata governance platform that collects SQL execution data, parses lineage with Antlr, stores graph relationships in Neo4j, and enables table/column lineage queries, DAG scheduling, and hot‑data analysis.

dbaplus Community

Jul 25, 2018

How Ele.me Built a Scalable Metadata Governance System for Big Data

Background

Ele.me processes massive volumes of data across heterogeneous storage engines and schedules tasks at minute, hourly, and daily granularity. To ensure data quality, governance, and efficient reuse, the platform requires a unified metadata service that can trace data lineage, enforce access control, and expose upstream/downstream dependencies at the table and column level.

Metadata Model and Value

The metadata service captures both static catalog information (databases, tables, columns, partitions) and dynamic execution context (tasks, job IDs, owners, engine parameters). It also records lifecycle events (creation, modification, deletion) and ETL scheduling details. This unified model enables:

Construction of data graphs and DAGs for task orchestration.

Dictionary look‑ups and resource‑consumption dashboards.

Searchable asset inventories for tables, columns, and partitions.

Open‑Source Alternatives

Two well‑known projects address similar problems:

WhereHows (LinkedIn) – relies on Azkaban for job logs and provides only table‑level lineage.

Apache Atlas – collects metadata via hooks, pushes events to Kafka, stores relations in Titan, and offers REST APIs; column‑level support is limited.

Ele.me Metadata System Architecture

The platform stores raw SQL, basic task metadata, and engine context in a relational database (MySQL). An Extract component iterates over stored SQL statements, parses them into table‑ and column‑level lineage, and combines the result with task and context information into a DataSet . The DataSet is persisted in two downstream stores:

Neo4j (Gremlin‑compatible) for graph queries such as upstream/downstream traversal.

Elasticsearch for full‑text search of tables, columns, and metadata attributes.

SQL Collection and Parsing

SQL is captured primarily at execution time. Ele.me’s scheduling system (EDW, similar to Apache Airflow) records the following for each job: jobId, appId, owner, submitTime, sqlText Engine‑specific listeners (Hive, Spark, Presto) implement hook interfaces to capture runtime context, metadata, and statistics, which are also written to the relational store.

Parsing uses an Antlr‑generated lexer and parser based on Hive’s grammar. The workflow is:

Lexical analysis creates a token stream from the SQL string.

The parser builds an abstract syntax tree (AST).

A visitor traverses the AST to extract source tables, target tables, column expressions, and any temporary objects.

Schema information is consulted to resolve SELECT *, CTAS, and UDF calls, ensuring accurate lineage.

Supported SQL Operations and Lineage Extraction

The parser recognises the following statements, which cover >99 % of production workloads:

SELECT / INSERT / INSERT OVERWRITE

CREATE TABLE

CREATE TABLE AS SELECT (CTAS)

DROP TABLE

CREATE VIEW / ALTER VIEW

For each statement the system produces a triple (input, operation, output) that can be materialised as graph edges.

Lineage Example

Given a statement that inserts the result of A + B into table C, the extracted lineage is:

input:  A (table/column)

input:  B (table/column)

operation:  A + B

output:  C

Column‑level lineage is derived similarly; for example, coalesce(name, count(id)) creates edges from name and id to the derived column lineage_name.

Graph Storage

Each input and output becomes a node; the operation becomes a directed edge. Neo4j stores these nodes and edges, enabling Gremlin traversals such as “find all upstream tables of column X”. The community edition runs on a single node; OrientDB is being evaluated as an alternative.

Key Use Cases at Ele.me

Metadata Search : Hive Metastore tables (DBS, TBLS, COLUMNS_V2, PARTITIONS) are indexed in Elasticsearch for fast lookup of table/column attributes.

Dynamic Dependency Graphs : Graph nodes represent tables; edges encode operation type and carry task execution IDs and timestamps, enabling DAG visualisation and re‑orchestration.

Hot‑Data Analysis : Counters on input/output occurrences identify frequently accessed tables and columns for capacity planning.

Task Scheduling : Table‑level lineage feeds DAG builders that schedule downstream jobs and enforce quality gates.

Technical Q&A Highlights

Data lifecycle management : Tables with no access in the last three months are flagged for archival; temporary tables are identified by a temp flag.

Quality monitoring impact : Quality metrics are incorporated into DAG construction; tasks failing quality checks are excluded from downstream execution.

SQL lineage storage : MySQL stores jobId, raw SQL, owner, and timestamps. The DataSet concept aggregates lineage, task, and engine context before persisting to Neo4j/ES.

Supported lineage sources : Only SQL‑based lineage is currently captured; non‑SQL Spark RDD or MLlib outputs are not ingested.

Hotness analysis : Input/output table counters are incremented on each execution; top‑N tables/columns are derived from these counters.

Engine hooks : Hive, Spark, and Presto provide listener interfaces; implementing these interfaces is sufficient—no core engine modification is required.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

graph database data lineage SQL parsing Ele.me metadata governance

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.