Designing Scalable Open‑Source ETL Systems: Lessons from Baidu Waimai
This talk details Baidu Waimai's end‑to‑end ETL design, covering demand sources, data flow patterns, multi‑stage system evolution, storage choices, scheduling architecture, configuration‑driven processing, quality monitoring, and how data lineage enables transparent, self‑service data delivery.
Introduction
The presenter, a former Baidu Quality Department engineer who later joined Baidu Waimai, shares practical experiences on building and operating an open‑source ETL system for a large e‑commerce and real‑time logistics platform.
ETL Demand Sources and Business Scenarios
Data needs are divided into three categories: (1) analytical reports such as daily city‑level dashboards; (2) product‑related metrics for merchant dashboards, BD lead generation, and KPI evaluation; (3) strategic data for ranking, recommendation, and profiling. These drive the design of extraction, transformation, and loading processes.
Two main data flows exist: full‑table or column‑level extraction from business MySQL databases, and internal warehouse table transformations (e.g., aggregating fact tables with dimension tables).
Logical Mapping and Business Viewpoints
Simple 1‑to‑1 mappings (e.g., raw order price) coexist with complex many‑to‑one calculations (e.g., net price after multiple subsidies). Some fields are stored as compressed JSON, requiring custom parsing logic. Data is classified as fact, dimension, or slowly changing facts, each with distinct update and consistency challenges.
ETL System Evolution Stages
Script + SQL on a single machine.
Distributed batch processing using MapReduce/Spark on Hadoop.
Scheduler‑centric platform offering monitoring, retries, and performance analysis.
Fine‑grained ETL with data‑availability monitoring, transparent production progress, metadata management, and field‑level lineage.
Overall Data Production Architecture
The pipeline consists of four layers: data sources (MySQL, HDFS/FTP files, Kafka/NMQ, APIs), production (ETL system, Spark Streaming for logs), warehouse, and data applications.
ODS Layer and Storage Choices
ODS acts as an isolation layer between source systems and the warehouse, supporting primary‑key queries, OLTP‑style access, and back‑track updates. Three storage options were evaluated: row‑incremental (e.g., Druid), columnar MPP (e.g., Vertica, Greenplum), and hybrid OLTP + OLAP (e.g., Phoenix on HBase, Kudu). The team uses Phoenix for row‑level updates and Kudu for experimental high‑performance writes.
Scheduling, Operators, and Execution Model
Jobs are defined as DAGs of job → transformer → operator. Each operator is the smallest execution unit, launched on the least‑loaded node. Operators include ETL, Sqoop, and WebService types, with parameters supplied via predefined configuration.
ETL Framework Process Design
Two primary configuration patterns cover >90% of use cases:
Simple many‑to‑many field mappings that can be expressed as INSERT … SELECT statements, possibly across heterogeneous sources.
Open‑index configurations for field‑level ETL that require custom functions (e.g., JSON parsing, complex arithmetic). Each open‑index entry defines source connection, target table, column mapping, and an optional custom handler.
The execution flow initializes parameters, checks target partitions, writes a start signal, runs custom code if present, processes open‑SQL and open‑index configurations, merges results, writes a completion signal, and updates monitoring services.
OpenSQL and OpenIndex Examples
Sample OpenSQL config shows source DB connection, target table, and field‑by‑field mapping. OpenIndex example demonstrates a custom function that receives a two‑dimensional array of primary‑key‑indexed rows, parses JSON fields, performs arithmetic, and returns new columns for upsert.
Enhanced Sqoop Workflow
The team wrapped the open‑source Sqoop with additional steps: schema diff detection, partition creation, record count for concurrency control, split‑by and mapper settings, field‑type mapping (e.g., MySQL unsigned bigint → Hive string), optional compression, and post‑load Hive/Impala metadata refresh.
Data Quality Monitoring and Checkpoints
Both upstream and downstream checkpoints verify input expectations and output standards. Custom monitoring items include table‑structure checks, data‑availability checks (SQL‑based comparisons, trend thresholds, null‑rate checks), and alert escalation (email, SMS, phone) based on severity.
Data Delivery, Signal Service, and Fine‑Grained Dependencies
Signal services expose APIs to write and query production signals, enabling downstream consumers to poll for data availability. Field‑level signals allow precise dependency management, reducing unnecessary waiting for unrelated columns.
Data Explainability via Lineage
Because most ETL logic is configuration‑driven, the system can automatically generate table‑level and field‑level lineage graphs, exposing upstream dependencies and simplifying explanations for data consumers.
Conclusion
The presentation wraps up by emphasizing the importance of a transparent, configurable, and monitorable ETL platform that supports both batch and real‑time needs while keeping data lineage and quality at the forefront.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
