Big Data 8 min read

The Flourishing Big Data Ecosystem and the Rise of Delta Lake

The article reviews the evolution of the big‑data ecosystem from 2017 to 2019, highlights Spark’s dominance, examines storage‑layer challenges of traditional Hive‑based warehouses, and explains how Delta Lake’s metadata‑driven library simplifies architecture, adds ACID features, and competes with Hudi and Iceberg.

Big Data Technology Architecture

May 10, 2020

The Flourishing Big Data Ecosystem

2017 and 2018 were the hot years for compute engines; by 2019 the market had become a red sea. Spark emerged as the king of compute engines with the best overall metrics and a rich ecosystem, and while other engines were still battling in ETL, interactive queries, and streaming, Spark was already advancing deep into the AI field.

In 2017‑2018 the upper and lower layers of the compute stack showed little progress, but 2019 brought a shift: the storage layer was revitalized by Delta Lake, which solved many data‑warehouse pain points and turned warehouses into data lakes, while the interactive application layer was led by Linkis, establishing interaction standards and tightly coupling the surrounding ecosystem.

Problematic Data Storage Layer

Earlier Hive‑based warehouses or traditional file storage formats such as Parquet/ORC suffered from long‑standing issues:

Small‑file problem

Concurrent read/write challenges

Limited support for updates

Massive metadata (e.g., partitions) overwhelming the Hive metastore

Each of these issues spawns numerous application‑level problems. For example, concurrent reads/writes and update limitations make real‑time warehouses difficult to implement. The small‑file issue forces developers to write custom merge code, which can render data unreadable during the merge process.

To compensate for these inherent deficiencies, architects resort to complex designs (often called lambda architectures). Updating data may require writing first to another system such as HBase, then exporting HBase data to Parquet/Hive tables for downstream consumption. The long pipelines involve schema transformations and disk reads, inflating operational costs, CPU/IO waste, and maintenance overhead.

The root cause of all these problems is a weak storage layer; the only way to “save the day” has been to mask the deficiency with elaborate architectures. Delta Lake cuts straight to the chase by addressing storage‑layer shortcomings, dramatically simplifying architecture, reducing operational costs, and lowering server expenses.

Delta Lake Arrives at the Right Time

Traditional data warehouses have suffered for a long time; Delta Lake emerged to solve the storage‑layer issues. Its key features include:

Metadata‑driven design built on HDFS that relieves the metastore bottleneck.

Support for richer update modes such as Merge, Update, and Delete, enabling seamless streaming writes and reads.

Unified batch‑and‑stream processing on the same table.

Versioning that allows time‑travel and recovery from accidental operations.

Delta Lake Is Just a Library

Delta Lake is a library, not a standalone service; unlike HBase it does not require separate deployment and currently only supports the Spark engine. This means you can use Delta Lake exactly like a regular Parquet file: simply add the Delta package to your Spark project and use the standard Spark datasource API, resulting in very low deployment and usage cost.

What Is Delta Lake

Parquet file + Meta file + a set of operation APIs = Delta Lake.

Thus Delta Lake is not mysterious; it differs from plain Parquet only by the format name. In Spark you replace the format string: parquet → detla Integration with Hive

Because of inertia and historical reasons, many users still wish to use Delta Lake as they would Hive, without dealing with Spark’s datasource API. As of the writing of this article, official support is not yet available, though Alibaba engineers are working on integration, and recent Delta Lake releases allow external engines like Presto to read data via the Manifest mechanism.

Competitors

Delta Lake’s main competitors are Apache Hudi and Apache Iceberg, together forming the “three‑horsemen” of data lake technology. Which one will prevail remains to be seen.

Iceberg aims to define a standard, open, and universal data format, abstracting away underlying storage differences and providing a unified API for multiple engines. Hudi focuses on fast ingestion of streaming data and supports upsert semantics for delayed corrections. Delta Lake, originated from Databricks, concentrates on solving inherent issues of Parquet/ORC at the Spark layer and adds many capabilities. All three provide ACID guarantees, implement optimistic concurrency control, and offer time‑travel functionality. Source: https://mp.weixin.qq.com/s/KqQzjgJyZmKrh8XsqjFIKg

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Spark Delta Lake

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.