Evolution of Data Platform Technology: From Data Warehouse to Lakehouse Architecture
The article traces the evolution of data platforms from early data warehouses—using schema‑on‑write, columnar storage, and MPP engines—to data lakes that retain raw data with schema‑on‑read, and finally to lakehouse architectures that merge storage and compute, offering unified metadata, versioning, and support for BI, big‑data, AI, and HPC workloads.
This article provides a comprehensive overview of data platform technology evolution, covering the historical development and key technological transformations in data engineering.
1. Data Value and Platform Composition
Data is compared to crude oil - valuable but requiring refinement to extract value. A data platform consists of three core components: storage systems (handling long time spans, distributed sources, centralized storage), compute engines (including TensorFlow/PyTorch/PaddlePaddle for deep learning, Hadoop MapReduce/Spark for offline computing, Apache Doris for BI analysis), and interfaces (primarily SQL).
2. Data Warehouse Technology
Data warehouses emerged from Business Intelligence needs, using OLAP technology. Key characteristics include distributed architecture, columnar storage, and MPP (Massively Parallel Processing) engines. Data warehouses follow a "Schema-on-Write" model, requiring ETL (Extract, Transform, Load) processes before data storage.
3. Data Lake Technology
Data lakes preserve raw data in original formats, using "Schema-on-Read" model. The evolution went through two phases: integrated storage-compute (Hadoop-based) and separated storage-compute (object storage-based). The separated architecture addresses challenges including independent scaling, HDFS NameNode bottlenecks, and storage costs.
4. Lakehouse Architecture
The modern data platform represents a convergence of data warehouse and data lake approaches. Key challenges addressed include data quality management, metadata governance, data versioning (with table formats like Apache Iceberg, Apache Hudi, Delta Lake), and data interoperability. The lakehouse architecture combines object storage with metadata and acceleration layers, supporting diverse compute engines including data warehouse, big data, AI, and HPC workloads.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.