From Data Integration to the Modern Data Stack: Concepts, Tools, and Practices
This article explains data integration fundamentals, compares data integration tools such as Stitch, Fivetran, and Airbyte, describes the concepts of data warehouses and data lakes, outlines ETL vs ELT processes, and explores building modern data stacks with Flink CDC and cloud services.
Introduction
The article is organized into four parts: Data Integration, Data Integration Tools, Modern Data Stack, and Modern Data Stack Practice.
1. Data Integration
Data integration combines multiple disparate data sources into a unified logical or physical view to solve data silos and support enterprise decision‑making. Its purpose dates back to 1991 with early systems like IPUMS that extracted, transformed, and loaded data into a unified schema.
Data Warehouse is defined as an integrated, subject‑oriented, time‑variant, non‑volatile collection of data for management decisions. Its primary goal is to integrate data from heterogeneous sources.
Data Lake, introduced in 2011, is a centralized storage for structured, semi‑structured, and unstructured data in native format, offering lower storage cost and schema‑on‑read capabilities.
ETL (Extract, Transform, Load) is the main workflow for data integration, consisting of data ingestion, cleaning/transforming, and loading into warehouses or lakes.
2. Data Integration Tools
Gartner’s Magic Quadrant evaluates traditional vendors (e.g., SAP) and modern cloud‑native providers (e.g., Talend, FiveTran). The article highlights three typical tools: Stitch, Fivetran, and Airbyte.
All three focus on data ingest (ETL/ELT) and support a large number of sources (Stitch >130, Fivetran >150, Airbyte >120).
They support destinations such as major databases, data warehouses, and data lakes.
Custom connectors are possible, and all support CDC incremental replication.
Pricing varies: Stitch and Fivetran charge per record sync, while Airbyte uses usage‑based pricing.
Integration with other stack components differs: Airbyte offers broader integration (dbt, Kubernetes, Airflow).
ETL vs ELT: ETL performs transformation before loading, while ELT loads raw data and lets the warehouse or lake handle transformation. Modern ELT emphasizes pushing transformation to the storage layer.
3. Modern Data Stack
The modern data stack builds on the traditional data stack by leveraging cloud‑native tools for better elasticity, scalability, and ease of use. It consists of a data stack (extraction, transformation, storage) and a modern data stack that adds cloud‑based services.
Key benefits include faster processing, reduced cost (no hardware maintenance), automation via fully managed services, and improved usability.
Examples: Fivetran for ingestion, dbt for transformation, Snowflake/Redshift for storage, and BI tools for analytics.
Airbyte’s open‑source stack separates ingestion, storage, transformation, metadata management, and analysis, often using ClickHouse for storage and Flink or Presto for transformation.
4. Modern Data Stack Practice
Various company implementations are shown, illustrating both traditional and modern stacks centered on Flink CDC.
Flink CDC supports a wide range of source types and offers powerful stream and table APIs for transformation. Its advantages over SaaS tools include full‑incremental framework, exactly‑once semantics, and sub‑second scheduling.
Typical modern stack architecture: data captured by Flink CDC (optionally via Kafka) → real‑time warehouse (e.g., Hologres) with layered ODS/DWD/DWS tables → downstream analytics for dashboards, reports, or applications.
Alibaba Cloud provides an integrated Flink CDC solution with enterprise‑grade features such as full‑incremental sync, schema change handling, and seamless integration with Hologres and Hudi.
Q&A Highlights
Transformation inside a warehouse can increase cost; materialized views can mitigate storage overhead.
Flink CDC supports Oracle; other tools like Stitch, Fivetran, and Airbyte also support Oracle.
ELT means loading raw data into the lake/warehouse and performing transformation there, with Flink acting only as a compute engine.
Flink CDC’s strengths: full‑incremental sync, exactly‑once, sub‑second latency, and flexible connector ecosystem.
Community version of Flink CDC provides DDL change events; cloud version integrates schema sync with Hologres/Hudi.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.