How Ele.me Built a Scalable Metadata Governance System for Big Data
This article explains how Ele.me tackles big‑data challenges by designing a metadata governance platform that collects SQL execution data, parses lineage with Antlr, stores graph relationships in Neo4j, and enables table/column lineage queries, DAG scheduling, and hot‑data analysis.
Background
Ele.me processes massive volumes of data across heterogeneous storage engines and schedules tasks at minute, hourly, and daily granularity. To ensure data quality, governance, and efficient reuse, the platform requires a unified metadata service that can trace data lineage, enforce access control, and expose upstream/downstream dependencies at the table and column level.
Metadata Model and Value
The metadata service captures both static catalog information (databases, tables, columns, partitions) and dynamic execution context (tasks, job IDs, owners, engine parameters). It also records lifecycle events (creation, modification, deletion) and ETL scheduling details. This unified model enables:
Construction of data graphs and DAGs for task orchestration.
Dictionary look‑ups and resource‑consumption dashboards.
Searchable asset inventories for tables, columns, and partitions.
Open‑Source Alternatives
Two well‑known projects address similar problems:
WhereHows (LinkedIn) – relies on Azkaban for job logs and provides only table‑level lineage.
Apache Atlas – collects metadata via hooks, pushes events to Kafka, stores relations in Titan, and offers REST APIs; column‑level support is limited.
Ele.me Metadata System Architecture
The platform stores raw SQL, basic task metadata, and engine context in a relational database (MySQL). An Extract component iterates over stored SQL statements, parses them into table‑ and column‑level lineage, and combines the result with task and context information into a DataSet . The DataSet is persisted in two downstream stores:
Neo4j (Gremlin‑compatible) for graph queries such as upstream/downstream traversal.
Elasticsearch for full‑text search of tables, columns, and metadata attributes.
SQL Collection and Parsing
SQL is captured primarily at execution time. Ele.me’s scheduling system (EDW, similar to Apache Airflow) records the following for each job: jobId, appId, owner, submitTime, sqlText Engine‑specific listeners (Hive, Spark, Presto) implement hook interfaces to capture runtime context, metadata, and statistics, which are also written to the relational store.
Parsing uses an Antlr‑generated lexer and parser based on Hive’s grammar. The workflow is:
Lexical analysis creates a token stream from the SQL string.
The parser builds an abstract syntax tree (AST).
A visitor traverses the AST to extract source tables, target tables, column expressions, and any temporary objects.
Schema information is consulted to resolve SELECT *, CTAS, and UDF calls, ensuring accurate lineage.
Supported SQL Operations and Lineage Extraction
The parser recognises the following statements, which cover >99 % of production workloads:
SELECT / INSERT / INSERT OVERWRITE
CREATE TABLE
CREATE TABLE AS SELECT (CTAS)
DROP TABLE
CREATE VIEW / ALTER VIEW
For each statement the system produces a triple (input, operation, output) that can be materialised as graph edges.
Lineage Example
Given a statement that inserts the result of A + B into table C, the extracted lineage is:
input: A (table/column) input: B (table/column) operation: A + B output: CColumn‑level lineage is derived similarly; for example, coalesce(name, count(id)) creates edges from name and id to the derived column lineage_name.
Graph Storage
Each input and output becomes a node; the operation becomes a directed edge. Neo4j stores these nodes and edges, enabling Gremlin traversals such as “find all upstream tables of column X”. The community edition runs on a single node; OrientDB is being evaluated as an alternative.
Key Use Cases at Ele.me
Metadata Search : Hive Metastore tables (DBS, TBLS, COLUMNS_V2, PARTITIONS) are indexed in Elasticsearch for fast lookup of table/column attributes.
Dynamic Dependency Graphs : Graph nodes represent tables; edges encode operation type and carry task execution IDs and timestamps, enabling DAG visualisation and re‑orchestration.
Hot‑Data Analysis : Counters on input/output occurrences identify frequently accessed tables and columns for capacity planning.
Task Scheduling : Table‑level lineage feeds DAG builders that schedule downstream jobs and enforce quality gates.
Technical Q&A Highlights
Data lifecycle management : Tables with no access in the last three months are flagged for archival; temporary tables are identified by a temp flag.
Quality monitoring impact : Quality metrics are incorporated into DAG construction; tasks failing quality checks are excluded from downstream execution.
SQL lineage storage : MySQL stores jobId, raw SQL, owner, and timestamps. The DataSet concept aggregates lineage, task, and engine context before persisting to Neo4j/ES.
Supported lineage sources : Only SQL‑based lineage is currently captured; non‑SQL Spark RDD or MLlib outputs are not ingested.
Hotness analysis : Input/output table counters are incremented on each execution; top‑N tables/columns are derived from these counters.
Engine hooks : Hive, Spark, and Presto provide listener interfaces; implementing these interfaces is sufficient—no core engine modification is required.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
