How Alibaba’s DChain Data Converger Auto‑Generates Real‑Time Wide Tables with SQL Pipelines
This article explains how the ADC (Alibaba DChain Data Converger) project automatically creates large real‑time tables by letting users configure metrics on the front‑end, then generating and publishing SQL through a pipeline that leverages design patterns, priority queues, and tree‑based data structures for efficient cross‑database processing.
Overview
The ADC (Alibaba DChain Data Converger) project provides a tool that automatically builds a wide, real‑time table after users configure the desired metrics on a front‑end UI. The system ingests data from heterogeneous source databases, transforms it into a unified internal format, and generates the necessary SQL and table definitions for downstream real‑time query engines.
System Architecture
Data originates from a metadata centre, passes through a metadata‑adaptation layer that normalises it, and is handed to a scheduling centre. The scheduler analyses the data‑access requirements, builds a logical model and invokes a SQL generator to create resource definitions and executable SQL statements. Supporting components include an alarm centre (for error and latency alerts), a reconciliation centre (for data‑alignment checks), a resource‑management centre (for lifecycle handling), and basic adapters that integrate Alibaba Cloud services such as MaxCompute, Flink, and AnalyticDB.
Technical Implementation
Requirement Analysis
Support multiple fact (stream) tables and multiple dimension tables; one fact table is designated as the primary table.
Any change in a dimension table must trigger an update of the final wide table.
The relationship between the primary fact table and each auxiliary table is either 1:1 or N:1, with the fact table at the finest granularity.
When several stream tables share the same primary key, they are combined with a full join; tables with different keys are combined with a left join.
Only join operations and user‑defined functions (UDFs) are required; no GROUP BY clauses are used.
Low synchronization latency is required, while query QPS is not a constraint because the target media are designed for high‑throughput reads.
System Flow
Synchronous generation of SQL statements and table definitions.
Asynchronous publishing of the generated SQL and table metadata to the execution engine.
The synchronous stage performs fast validation and fails early if the input metadata is inconsistent. The asynchronous publishing stage records execution state, supports rollback and retry, and may wait for metric collection before completing.
Check Stage
Validate required parameters (e.g., presence of at least one fact table).
Validate source‑type compatibility.
Validate that a partition field for the wide table is provided.
Validate join constraints (key compatibility, cardinality).
Check primary‑key uniqueness for the fact table.
Verify that necessary metadata (e.g., HBase configuration) exists.
Correct primary‑key definitions for dimension tables to ensure they are truly unique.
Data Synchronization
Generate a priority queue that orders tasks such as leaf‑node sync, join creation, and publishing.
Synchronously load source tables into HBase, performing type conversion (e.g., MySQL → HBase) and ETL transformations.
Remove duplicate columns that appear in multiple sources.
Mark columns that should not appear in the final wide table (blank‑column flag).
Populate ordering fields using either source timestamps or the system ingestion time.
Compute Stage
Create intermediate tables by executing full joins of leaf nodes.
Upgrade join relationships to left joins where required.
Populate reverse‑index tables that map fact‑table primary keys to dimension‑table foreign keys.
Insert message‑queue entries to trigger downstream processing when intermediate tables are updated.
Fill the final wide table with aligned join keys, ETL metadata, and partition fields.
Generate Flink‑compatible DDL/DML SQL for source tables, intermediate tables, and the result wide table.
Persist the generated SQL and table definitions in a metadata store for reuse.
Asynchronous publishing pushes the generated SQL to Flink for execution.
Reverse Index Reasoning
When a dimension table (B) lags behind or leads the fact table (A), a reverse‑index table stores the mapping between A’s primary key and B’s foreign key. Updates to B can then be propagated to the wide table by looking up the corresponding A keys in the reverse‑index and re‑emitting the changes through Flink.
Design Patterns
The system adopts a pipeline (PipeLine) design pattern. A global PipeLineContainer manages multiple pipelines, each consisting of reusable valve components. A pipeline represents a logical stage (e.g., synchronous SQL generation, asynchronous publishing) and tracks its execution state, allowing rollback or retry. Valves can be combined to implement new processing steps without modifying existing code.
Data Structures and Algorithms
The core problem is to represent the relationships among tables (Meta nodes) as either full joins or left joins and to execute them in a deterministic order. The solution builds a priority queue of tasks, then constructs a tree where leaf nodes correspond to synchronized data sources, internal nodes represent join operations, and the root node is the final wide table.
Algorithm Steps
Define four priority levels:
Priority 1 – Synchronize the six leaf nodes (source tables).
Priority 2 – Execute full‑join tasks that combine leaf nodes into intermediate tables.
Priority 3 – Execute left‑join tasks that merge intermediate tables into the final root.
Priority 4 – Publish the root (wide table) to the execution engine.
Consume all Priority 1 tasks, loading each source table into HBase.
Consume Priority 2 tasks, creating intermediate tables via full joins.
Consume Priority 3 tasks, performing left joins to produce the root wide table.
Consume the single Priority 4 task, persisting the wide‑table metadata and publishing the SQL to Flink.
Summary
The ADC SQL generator transforms heterogeneous source tables into a single wide, real‑time table by modelling table relationships as a priority‑queue‑driven tree. The pipeline architecture, reverse‑index mechanism, and explicit validation stages ensure low‑latency, fault‑tolerant execution while keeping the implementation extensible for future data‑source or compute‑engine integrations.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
