Big Data 32 min read

Why Data Lakes and Data Warehouses Must Converge: The Rise of Lakehouse Architecture

This article traces 20 years of big‑data evolution, defines data lakes and data warehouses, compares their trade‑offs, and explains how lakehouse solutions—exemplified by Alibaba Cloud MaxCompute—merge flexibility with enterprise‑grade performance to lower total ownership cost.

Architects' Tech Alliance

Dec 7, 2022

Why Data Lakes and Data Warehouses Must Converge: The Rise of Lakehouse Architecture

20 Years of Big Data Evolution

The big‑data field has evolved for two decades and can be summarized in five major trends:

Data continues to grow at high speed; Alibaba’s ecosystem has seen 60‑80% annual growth for the past five years.

Big data is now recognized as a core production factor, widely adopted across enterprises and governments.

Data‑management capabilities have become a new focus, with data‑warehouse (middle‑platform) solutions gaining popularity.

Engine technologies have entered a convergence phase: Spark, Flink, HBase, Presto, Elasticsearch, Kafka, etc., have matured and are now deepening performance and stability.

Platform evolution shows two divergent directions—data lake versus data warehouse.

What Is a Data Lake?

Wikipedia defines a data lake as a system that stores data in its natural binary or file format, keeping raw copies and transformed data for reporting, visualization, analytics, and machine learning. It supports structured, semi‑structured, and unstructured data and can be built on Hadoop HDFS, Azure Data Lake, or AWS Lake Formation. The four essential characteristics are:

Unified storage system.

Retention of raw data.

Rich compute models/paradigms.

Independence from any specific cloud.

Three architectural stages of data lakes are described:

Stage 1 – Self‑built open‑source Hadoop lake (HDFS + Hadoop/Spark engines).

Stage 2 – Cloud‑hosted Hadoop lake (e.g., EMR) where the cloud provider manages the underlying servers and software.

Stage 3 – Pure cloud‑hosted lake where storage (e.g., S3 or OSS) is fully managed and multiple engines (Athena, Spark, Flink, etc.) can access the data.

What Is a Data Warehouse?

According to Wikipedia, a data warehouse is a central repository for reporting and analysis, integrating data from multiple sources. It originated in the 1990s (Inmon’s definition) and focuses on ETL/ELT, schema‑based storage, and strong modeling for business‑intelligence workloads.

Modern cloud data‑warehouse products (AWS Redshift, Google BigQuery, Snowflake, Alibaba MaxCompute) provide integrated storage, compute, and management capabilities, often exposing only service interfaces rather than raw file systems.

Key advantages of data warehouses include:

Deep engine understanding of data, enabling storage‑compute optimizations.

Full lifecycle data management with lineage.

Fine‑grained security and governance.

Robust metadata services that simplify building enterprise data‑mid platforms.

Data Lake vs. Data Warehouse

Data lakes prioritize flexibility by exposing raw file storage, allowing any data type (structured, semi‑structured, unstructured) to be ingested without upfront schema. This flexibility, however, makes fine‑grained permission control, unified file management, and consistent interfaces difficult.

Data warehouses prioritize efficiency, security, and governance by abstracting storage behind service APIs, enforcing schemas, and providing mature metadata, which yields higher performance and lower operational risk at scale.

Enterprises at different maturity stages may prefer one approach: startups value flexibility (lake), while mature organizations need growth‑oriented performance and governance (warehouse).

Next‑Generation Lakehouse

The article proposes a lakehouse that unifies the strengths of both architectures. Three key challenges must be solved:

Seamless, automated data and metadata exchange between lake and warehouse.

A unified development experience that works across both storage systems.

Intelligent caching/moving of hot data to the warehouse while keeping cold data in the lake.

Alibaba Cloud Lakehouse Solution

Alibaba Cloud extends MaxCompute (a cloud data‑warehouse) with open‑source lake capabilities, delivering an integrated architecture:

Fast Access: PrivateAccess network connects VPC, IDC, ECS, and EMR clusters with low latency and dedicated bandwidth.

Unified Data/Metadata Management: One‑click DB metadata mapping links Hive Metastore databases to MaxCompute projects, enabling real‑time metadata synchronization.

Unified Development Experience: DataWorks provides a single platform for lake and warehouse development, tracking, and governance; MaxCompute is compatible with Hive and Spark.

Automatic Warehousing: Intelligent caching analyzes historical jobs to identify hot data and automatically caches it in the warehouse, improving performance without user intervention.

Customer Case: Sina Weibo

Weibo’s machine‑learning platform needed both flexible lake storage for AI workloads and high‑performance warehouse capabilities. Their existing open‑source Hadoop lake could not meet scale and cost requirements.

By integrating MaxCompute with EMR Hadoop via the lakehouse solution, Weibo achieved:

Elimination of data‑movement overhead; jobs run seamlessly on MaxCompute or EMR.

Significant performance gains for SQL‑based data processing.

Resource complementarity: MaxCompute’s elastic resources and EMR’s compute capacity balance each other, reducing queue times and overall cost.

Conclusion

Data lakes and data warehouses represent two design philosophies for big‑data platforms—flexibility versus enterprise‑grade growth. Their boundaries are converging as lakehouse architectures combine lake flexibility with warehouse performance, governance, and cost efficiency. Alibaba’s MaxCompute lakehouse exemplifies this trend and is positioned as a next‑generation big‑data platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Warehouse MaxCompute Data Lake Big Data Architecture Lakehouse Cloud Data Platform

Written by

Architects' Tech Alliance

Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.