Can Data Lakes and Data Warehouses Coexist? Exploring the Lake‑Warehouse Fusion
This article traces 20 years of big‑data evolution, compares data lakes and data warehouses, defines both concepts, examines their technical trade‑offs, and presents Alibaba Cloud’s lake‑warehouse (lakehouse) solution that unifies flexible storage with enterprise‑grade performance and governance.
1. Evolution of the Big Data Field Over 20 Years
Big data has grown rapidly for two decades, driven by massive data generation in e‑commerce, finance, and other sectors. Five key trends emerge: continuous data volume growth, recognition of data as a new production factor, rising importance of data‑management capabilities, convergence of engine technologies, and the diverging paths of data lakes versus data warehouses.
1.1 Overview
Since the early 2000s, data volume has exploded, with Alibaba’s own data scaling 60‑80% annually and many startups seeing >200% growth.
1. Data keeps growing at high speed – driven by the 5V core elements of big data.
2. Data is recognized as a new production factor – enterprises now treat data as a core asset for reporting, analytics, and AI.
3. Data‑management capability becomes a focus – data‑mid‑platform capabilities are essential for competitive advantage.
4. Engine technology enters a convergence phase – Spark, Flink, HBase, Presto, Elasticsearch, and Kafka have matured, shifting from rapid innovation to performance and stability improvements.
5. Platform technology evolves into two trends: Data Lake vs. Data Warehouse – both address storage and management but follow different design philosophies.
2. What Is a Data Lake?
A data lake stores raw data in its native binary or file format, keeping all enterprise data—structured, semi‑structured, and unstructured—in a unified repository. Wikipedia defines it as a system that holds original copies and transformed data for reporting, visualization, analytics, and machine learning. AWS defines a data lake as a centralized storage that accepts data “as‑is” and supports diverse analyses.
Four essential characteristics emerge:
Unified storage system
Retention of raw data
Rich compute models/paradigms
Independence from any specific cloud provider
Open‑source Hadoop HDFS exemplifies a classic data lake, while modern cloud‑native lakes (e.g., AWS S3‑based, Alibaba OSS‑based) adopt a storage‑compute‑separation architecture.
3. The Birth of Data Warehouses and Their Relation to Data Mid‑Platforms
Data warehouses originated in the database era of the 1990s, focusing on complex queries and analytics. With the rise of big‑data technologies, warehouses adopted SQL, query optimizers, and other database concepts, becoming the backbone of modern analytics platforms.
Wikipedia defines a data warehouse as a central repository for integrated data from multiple sources, supporting reporting, BI, and decision‑making. The academic definition traces back to W. H. Inmon (1990), emphasizing ETL/ELT processes, schema‑based storage, and support for OLAP, data mining, and DSS.
Key warehouse characteristics:
Built‑in storage system exposing tables or views, not raw files.
Data is cleaned and transformed via ETL/ELT.
Emphasis on modeling and data management for BI.
Modern cloud warehouses (AWS Redshift, Google BigQuery, Alibaba MaxCompute) inherit these traits, providing APIs for data import (e.g., Redshift COPY, BigQuery Data Transfer, MaxCompute Tunnel).
4. Data Lake vs. Data Warehouse
Both architectures address large‑scale data storage, but they differ in design priorities:
Flexibility (Data Lake) : Open file storage (e.g., HDFS, S3) accepts any data format, enabling diverse engines to read/write without strict schemas. However, fine‑grained security, unified governance, and performance optimizations are harder.
Growth‑Oriented (Data Warehouse) : Abstracted data services enforce schemas, provide strong security, fine‑grained access control, and enable cost‑effective performance at scale.
The choice depends on the enterprise stage: startups favor lake flexibility; mature companies benefit from warehouse growth and governance.
5. Next‑Generation Direction: Lake‑Warehouse Integration (Lakehouse)
Recognizing the complementary strengths of lakes and warehouses, the authors propose a unified “lake‑warehouse” architecture that allows data and compute to flow freely between the two, reducing total cost of ownership.
Three key challenges must be solved:
Seamless data/metadata integration without manual intervention.
Unified development experience across storage systems.
Automated caching/moving of hot data from lake to warehouse.
6. Alibaba Cloud Lake‑Warehouse Solution
6.1 Overall Architecture
MaxCompute (Alibaba’s cloud data warehouse) integrates open‑source lake components and cloud‑native storage, presenting a unified storage access layer and metadata service. Users can query tables across both lake and warehouse transparently, while benefiting from unified security, management, and governance.
Four core technical capabilities:
Fast Access : PrivateAccess network connects MaxCompute with IDC/ECS/EMR clusters, offering low latency and dedicated bandwidth.
Unified Data/Metadata Management : One‑click DB metadata mapping links Hive Metastore databases to MaxCompute projects, enabling real‑time bidirectional metadata sync.
Unified Development Experience : Hive databases appear as external MaxCompute projects, allowing DataWorks to manage lake and warehouse jobs uniformly; MaxCompute is compatible with Hive and Spark.
Automatic Warehouse : Intelligent cache analyzes historical job patterns, automatically moving hot lake data into warehouse‑optimized formats, eliminating manual data‑tiering.
6.2 Building a Lake‑Warehouse Data Mid‑Platform
DataWorks abstracts the underlying heterogeneity, offering a single data‑mid‑platform where one dataset and one job can be scheduled across lake and warehouse seamlessly.
Enterprises can store raw data in the lake for flexibility, while high‑frequency, performance‑critical workloads are automatically cached in the warehouse, achieving optimal cost‑performance balance.
6.3 Customer Case: Sina Weibo AI Computing Mid‑Platform
Weibo’s machine‑learning platform needed both the flexibility of an open‑source Hadoop lake (HDFS + Hive/Spark/Flink) and the performance‑cost benefits of MaxCompute. By adopting Alibaba’s lake‑warehouse solution, Weibo achieved:
Unified AI compute platform without data movement or job migration.
Significant performance gains for SQL‑based data processing on MaxCompute.
Elastic resource sharing between MaxCompute and EMR, reducing queuing and overall cost.
7. Conclusion
Data lakes and data warehouses represent two design directions for big‑data systems—flexibility versus enterprise‑grade growth. Their boundaries are blurring as lakes improve governance and warehouses extend to external storage. Alibaba’s MaxCompute lake‑warehouse (lakehouse) unifies the strengths of both, offering lower total ownership cost and representing the next evolution of big‑data platforms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
