Big Data 10 min read

Delta Lake: Architecture, Features, and Hands‑On Tutorial

This article explains the origins and motivations of Delta Lake, details its ACID transaction support, schema enforcement, metadata handling, versioning, and unified batch‑and‑stream processing, and provides a step‑by‑step Maven and Spark code tutorial for creating, updating, and querying Delta tables.

Big Data Technology & Architecture

Oct 17, 2019

Delta Lake: Architecture, Features, and Hands‑On Tutorial

On October 16, 2019 at the Spark+AI Europe Summit in Amsterdam, Databricks and the Linux Foundation announced that the open‑source project Delta Lake became a Linux Foundation hosted project.

Delta Lake is a storage layer that brings ACID transactions, schema enforcement, scalable metadata handling, time‑travel, and unified batch‑and‑stream processing to data lakes built on HDFS or cloud storage.

The article describes the motivations behind Delta Lake, the shortcomings of traditional data lakes (unreliable reads/writes, low data quality, performance degradation, difficult updates), and how Delta Lake addresses them.

Key features listed include ACID transactions, schema management, scalable metadata, data versioning and time travel, unified batch/stream sink, upcoming support for updates, deletes, and data expectations.

Delta Lake’s ACID guarantees rely on storage‑level atomic visibility, mutual exclusion, and consistent listings.

A quick‑start example shows adding the Maven dependency

<dependency>
  <groupId>io.delta</groupId>
  <artifactId>delta-core_2.11</artifactId>
  <version>0.4.0</version>
</dependency>

and creating a Delta table with Spark, followed by code snippets for inserting data, updating, deleting, and performing upserts using the DeltaTable API:

import org.apache.spark.sql.SparkSession;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;

SparkSession spark = ...; // create SparkSession
Dataset<Row> data = spark.range(0, 5);
data.write().format("delta").save("/tmp/delta-table");

Dataset<Row> data = spark.range(5, 10);
data.write().format("delta").mode("overwrite").save("/tmp/delta-table");

DeltaTable deltaTable = DeltaTable.forPath("/tmp/delta-table");
// update even ids
deltaTable.update(functions.expr("id % 2 == 0"), new HashMap<String, Column>() {{
    put("id", functions.expr("id + 100"));
}});
// delete even ids
deltaTable.delete(functions.expr("id % 2 == 0"));
// upsert (merge)
Dataset<Row> newData = spark.range(0, 20).toDF();
deltaTable.as("oldData")
    .merge(newData.as("newData"), "oldData.id = newData.id")
    .whenMatched().update(new HashMap<String, Column>() {{
        put("id", functions.col("newData.id"));
    }})
    .whenNotMatched().insertExpr(new HashMap<String, Column>() {{
        put("id", functions.col("newData.id"));
    }})
    .execute();

Finally, reading the table with spark.read.format("delta").load("/tmp/delta-table") shows the updated rows, illustrating how Delta Lake enables reliable, versioned data processing on top of Apache Spark.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data maven ACID Data Lake Apache Spark Scala Delta Lake

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.