Big Data 7 min read

Is Storage‑Compute Separation the Future? Unpacking the Lakehouse Debate

The article examines the concepts of storage‑compute separation and the lake‑warehouse (lakehouse) model, tracing their evolution from physical Hadoop clusters to containerized compute and object storage, and argues that true separation requires MPP systems to adopt open standards, effectively merging lake and warehouse architectures.

ITPUB

Sep 11, 2024

Storage‑Compute Separation

Traditional Hadoop clusters in banking are deployed on physical servers where disks and CPUs share the same node. Storage‑compute separation moves persistent data to an external object store (e.g., S3, Ceph, OBS) while the original servers become stateless compute nodes that can be containerised (Docker, Kubernetes). This physical separation follows the original logical separation in Hadoop: HDFS provides the storage layer, while MapReduce, Spark, and YARN provide the processing and resource‑management layers.

Key technical requirements :

Data must be stored in an open, vendor‑neutral format (Parquet, ORC, Avro) so that any compute engine can read it without conversion.

The compute environment must be able to access the object store via standard APIs (S3‑compatible, HDFS‑compatible, or HTTP).

Resource scheduling remains independent of storage; YARN or Kubernetes can allocate CPU/memory based solely on workload needs.

Logical vs. Physical Separation

Logical separation existed from Hadoop’s inception: HDFS (storage) is decoupled from execution engines (MapReduce, Spark) and from the scheduler (YARN). Physical separation is achieved when the storage tier is moved off the compute nodes onto dedicated storage systems (object stores, distributed file systems) and the compute tier is replaced by containerised workloads.

Because the logical contract is defined by open standards, any physical implementation that respects those standards automatically supports cross‑platform data sharing.

Lake vs. Warehouse

Data lakes emerged to address workloads that traditional data warehouses struggle with, such as:

Large volumes of semi‑structured or unstructured data (logs, JSON, video).

Low‑latency streaming ingestion and processing.

Machine‑learning pipelines that require direct access to raw files.

Technically, a lake is built on the Hadoop ecosystem (or compatible open‑source stacks) that rely on open file formats and APIs. Commercial distributions that retain these standards are still considered part of the lake architecture.

Data warehouses are typically based on massively parallel processing (MPP) engines (e.g., Teradata, Greenplum, GaussDB, GBase). These systems are often closed; their openness is judged by whether they can consume open formats like Parquet directly without an ETL conversion step.

Comparison Table

Component          | Lake (Hadoop ecosystem)          | Warehouse (MPP)
-------------------|-----------------------------------|-------------------------------
Storage format      | Open (Parquet, ORC, Avro)          | Often proprietary; may support Parquet
Compute engine     | Spark, Flink, Presto, Hive         | Native MPP SQL engine
Scalability        | Horizontal scaling of storage & compute | Scale‑out via additional nodes
Flexibility        | Supports batch, streaming, ML     | Optimized for ANSI‑SQL analytics
Openness           | Open‑source components, standards | Varies; closed unless Parquet support added

Implications for Unified Platforms

To achieve a true unified data platform, MPP warehouses must adopt the same open standards that define the Hadoop ecosystem. When an MPP engine can read/write Parquet (or other open formats) directly from object storage, the distinction between “lake” and “warehouse” collapses: the same data can be queried with both SQL‑centric MPP workloads and Spark‑style processing.

In practice, this means:

Deploy an object store (e.g., Amazon S3, Alibaba OSS, MinIO) as the single source of truth.

Configure the MPP engine to use the object store as an external table source (e.g., CREATE EXTERNAL TABLE … LOCATION 's3://bucket/...').

Run containerised Spark or Flink jobs that read the same Parquet files from the object store.

Maintain a unified metadata layer (e.g., Hive Metastore, AWS Glue) so both engines share schema definitions.

When these steps are in place, data residency, governance, and performance can be managed uniformly across both lake‑style and warehouse‑style workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Architecture Hadoop Storage Compute Separation Lakehouse MPP Open standards

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.