Big Data 25 min read

Evolution of Data Platform Technology: From Data Warehouse to Lakehouse Architecture

The article traces the evolution of data platforms from early data warehouses—using schema‑on‑write, columnar storage, and MPP engines—to data lakes that retain raw data with schema‑on‑read, and finally to lakehouse architectures that merge storage and compute, offering unified metadata, versioning, and support for BI, big‑data, AI, and HPC workloads.

Baidu Geek Talk

Jul 1, 2022

Evolution of Data Platform Technology: From Data Warehouse to Lakehouse Architecture

This article provides a comprehensive overview of data platform technology evolution, covering the historical development and key technological transformations in data engineering.

1. Data Value and Platform Composition

Data is compared to crude oil - valuable but requiring refinement to extract value. A data platform consists of three core components: storage systems (handling long time spans, distributed sources, centralized storage), compute engines (including TensorFlow/PyTorch/PaddlePaddle for deep learning, Hadoop MapReduce/Spark for offline computing, Apache Doris for BI analysis), and interfaces (primarily SQL).

2. Data Warehouse Technology

Data warehouses emerged from Business Intelligence needs, using OLAP technology. Key characteristics include distributed architecture, columnar storage, and MPP (Massively Parallel Processing) engines. Data warehouses follow a "Schema-on-Write" model, requiring ETL (Extract, Transform, Load) processes before data storage.

3. Data Lake Technology

Data lakes preserve raw data in original formats, using "Schema-on-Read" model. The evolution went through two phases: integrated storage-compute (Hadoop-based) and separated storage-compute (object storage-based). The separated architecture addresses challenges including independent scaling, HDFS NameNode bottlenecks, and storage costs.

4. Lakehouse Architecture

The modern data platform represents a convergence of data warehouse and data lake approaches. Key challenges addressed include data quality management, metadata governance, data versioning (with table formats like Apache Iceberg, Apache Hudi, Delta Lake), and data interoperability. The lakehouse architecture combines object storage with metadata and acceleration layers, supporting diverse compute engines including data warehouse, big data, AI, and HPC workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

OLAP Storage Compute Separation Data Architecture big data platform Lakehouse

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.