Big Data 25 min read

Evolution of Data Platform Technology: From Data Warehouse to Lakehouse Architecture

The article traces the evolution of data platforms from early data warehouses—using schema‑on‑write, columnar storage, and MPP engines—to data lakes that retain raw data with schema‑on‑read, and finally to lakehouse architectures that merge storage and compute, offering unified metadata, versioning, and support for BI, big‑data, AI, and HPC workloads.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Evolution of Data Platform Technology: From Data Warehouse to Lakehouse Architecture

This article provides a comprehensive overview of data platform technology evolution, covering the historical development and key technological transformations in data engineering.

1. Data Value and Platform Composition

Data is compared to crude oil - valuable but requiring refinement to extract value. A data platform consists of three core components: storage systems (handling long time spans, distributed sources, centralized storage), compute engines (including TensorFlow/PyTorch/PaddlePaddle for deep learning, Hadoop MapReduce/Spark for offline computing, Apache Doris for BI analysis), and interfaces (primarily SQL).

2. Data Warehouse Technology

Data warehouses emerged from Business Intelligence needs, using OLAP technology. Key characteristics include distributed architecture, columnar storage, and MPP (Massively Parallel Processing) engines. Data warehouses follow a "Schema-on-Write" model, requiring ETL (Extract, Transform, Load) processes before data storage.

3. Data Lake Technology

Data lakes preserve raw data in original formats, using "Schema-on-Read" model. The evolution went through two phases: integrated storage-compute (Hadoop-based) and separated storage-compute (object storage-based). The separated architecture addresses challenges including independent scaling, HDFS NameNode bottlenecks, and storage costs.

4. Lakehouse Architecture

The modern data platform represents a convergence of data warehouse and data lake approaches. Key challenges addressed include data quality management, metadata governance, data versioning (with table formats like Apache Iceberg, Apache Hudi, Delta Lake), and data interoperability. The lakehouse architecture combines object storage with metadata and acceleration layers, supporting diverse compute engines including data warehouse, big data, AI, and HPC workloads.

Data WarehouseOLAPdata lakestorage-compute separationdata-architecturebig data platformLakehouseMPP
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.