Big Data 26 min read

Designing an Open ETL System: Baidu Waimai’s Scalable Data Pipeline Practices

In this talk, a Baidu Waimai engineer explains the motivations, requirements, and architectural choices behind their open‑source ETL platform, covering data flow patterns, logical mappings, storage options, scheduling, metadata management, and quality monitoring to achieve scalable, transparent, and explainable data delivery.

ITPUB

Sep 29, 2017

Designing an Open ETL System: Baidu Waimai’s Scalable Data Pipeline Practices

Introduction

A Baidu employee who joined the company in 2014 and later worked on data warehouse and ETL for Baidu Waimai introduces the session, stating that the talk will cover four parts: ETL demand sources and business scenarios, required ETL characteristics and corresponding designs, the data delivery system ensuring availability and transparency, and how data lineage enables data explainability.

ETL Demand Sources and Business Scenarios

ETL serves three major data demand categories: (1) analytical reports such as daily business summaries and custom dashboards; (2) product‑related data for merchant dashboards, sales‑lead generation, KPI assessment, and internal finance or budgeting needs; (3) strategic applications like ranking, recommendation, and profiling, which benefit from well‑structured warehouse data.

Data flows fall into two directions: extracting whole tables or specific fields from business databases (often across sharded tables) and moving data between warehouse tables (e.g., aggregating fact tables with dimension tables).

Logical Mapping and Business Perspective

Logical mappings include simple 1‑to‑1 field copies (e.g., original price) and complex many‑to‑one calculations (e.g., final price = order price – merchant subsidy – platform subsidy – payment subsidy). When supplemental fields are stored as JSON, extra parsing logic is required.

From a business view, data can be classified as facts (immutable attributes like order ID), dimensions (city, delivery type), and “late facts” (attributes that change over time, such as order status), which introduce update and consistency challenges.

ETL Evolution Stages

The Baidu Waimai ETL system has progressed through four stages:

Script + SQL on a single relational database.

Distributed processing with MapReduce/Spark on the Hadoop ecosystem.

Scheduler‑centric platform providing monitoring, retries, and performance analysis.

Fine‑grained ETL with data‑availability monitoring, transparent production progress, and metadata‑driven lineage at field granularity.

Overall Data Production Architecture

The architecture consists of four layers: data sources (MySQL, HDFS/FTP files, Kafka/NMQ, APIs), production layer (ETL system, Spark Streaming for logs), warehouse layer, and data‑application layer. ETL jobs are triggered remotely by a scheduler and are composed of jobs → transformers → operators.

Data Production Flow

Data production starts with fact and dimension tables, creates intermediate (lightweight aggregation) tables, and then builds aggregated tables to avoid redundant queries. Pre‑computed aggregation tables follow the same principle as Kylin, reducing HDFS scans.

Ad‑hoc query logs are analyzed to identify hot tables/fields, which informs the creation of field‑level data permissions and lineage tracking. The lineage information dramatically reduces the cost of data explanation.

ODS Role and Storage Options

ODS acts as an isolation layer between business systems and the warehouse, storing data in a format close to the source to simplify extraction. It also supports low‑latency point queries and back‑fill updates.

Three storage options were evaluated:

Row‑incremental (e.g., Druid) – high write throughput but no updates.

Columnar MPP (e.g., Vertica, Greenplum) – excellent analytical performance but poor DML latency.

Hybrid OLTP + OLAP (e.g., Phoenix on HBase, Kudu) – balances low‑latency writes with analytical queries. Baidu Waimai uses Phoenix for row‑level updates and Impala/Hive for heavy analytics.

Scheduler Architecture

The scheduler models a job as a directed acyclic graph of operators. Each operator is the smallest execution unit and can be an ETL, Sqoop, or WebService operator. The scheduler selects the least‑loaded node, triggers tasks via time or dependency, and supports multiple operator types through configurable parameters.

ETL Framework Design

The framework distinguishes two major scenario categories covering >90% of requirements:

Multi‑field many‑to‑many mappings that can be expressed as INSERT … SELECT statements, possibly across heterogeneous sources (handled by “open‑SQL”).

Field‑level transformations that cannot be expressed in SQL, handled by “open‑index” with custom functions.

Configuration for open‑SQL includes source connection, select SQL, target connection, target table, and optional custom function flags. Open‑index configuration defines source/target tables, index columns, and a handle_flag pointing to a user‑defined function that processes a two‑dimensional array of source rows.

ETL Execution Process

The process flow is:

Initialize parameters and verify target partitions.

Write a start signal to the signal‑lamp service.

Execute any custom code if present.

Run open‑SQL configurations sequentially (or a specific one if specified).

Run open‑index configurations: extract, apply custom functions, merge results, and upsert.

Invoke monitoring services to validate results.

Write a completion signal.

Open‑SQL and open‑index examples are shown in the slides (images retained).

Enhanced Sqoop Workflow

The team wrapped open‑source Sqoop with additional features: automatic partition creation, concurrency control, schema‑diff detection with version‑controlled schema updates, record‑count verification, split‑by and map‑count tuning, field‑type mapping (e.g., MySQL unsigned bigint → Hive string), and optional compression. After data extraction, files are cached in HDFS and metadata is refreshed in Impala.

Data Delivery and Signal‑Lamp Service

Signal‑lamp APIs emit start and end signals for each table. Because the ETL system supports field‑level production, signals can also be emitted at field granularity, enabling downstream consumers to depend only on the fields they actually use.

Data Explainability

Since most ETL logic is configuration‑driven, the system can automatically generate data lineage (table‑level and field‑level dependencies). This lineage, combined with data‑definition metadata, provides clear explanations of how each data point is produced, addressing a common pain point for data engineers and analysts.

Conclusion

The presentation covered the full lifecycle of Baidu Waimai’s open‑ETL system, from demand analysis and architectural design to implementation details, monitoring, lineage, and delivery, demonstrating a practical, scalable approach to large‑scale data engineering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Big Data data pipeline Metadata Scheduling ETL

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.