Big Data 13 min read

Delta Lake: ACID Transactions, Schema Management, and Unified Batch‑Streaming for Data Lakes

Delta Lake adds ACID transaction support, schema enforcement, data versioning, and unified batch‑and‑stream processing to Apache Spark‑based data lakes, addressing reliability, quality, performance, and update challenges of traditional data lake architectures.

Big Data Technology & Architecture

Oct 19, 2020

Delta Lake: ACID Transactions, Schema Management, and Unified Batch‑Streaming for Data Lakes

Delta Lake is a storage layer that brings ACID transaction capabilities to Apache Spark and big‑data workloads by using optimistic concurrency control and snapshot isolation, providing consistent reads during writes and built‑in data versioning for reliable data lakes on HDFS or cloud storage.

Why Delta Lake Is Needed

Many organizations rely on data lakes for massive, heterogeneous data storage, but they suffer from unreliable reads/writes, low data quality, poor performance as data grows, and cumbersome update processes. These issues often cause big‑data projects to fail, creating a need for a solution that improves reliability and quality without replacing existing lakes.

Key Features of Delta Lake

ACID Transactions : Guarantees atomic, consistent, isolated, and durable operations across concurrent writes, with a transaction log that records write order and throws concurrency exceptions when conflicts occur.

Schema Management : Validates DataFrame schemas against table schemas, automatically adds new columns, and stores metadata in the transaction log for fast metadata handling.

Data Versioning : Keeps historical snapshots, allowing time‑travel queries by timestamp or version number to reproduce or roll back data.

Unified Batch and Streaming Sink : Works as a high‑performance streaming sink for Structured Streaming while retaining ACID guarantees.

Parquet Storage Format : Stores all data in open‑source Apache Parquet, leveraging its compression and encoding efficiencies.

Update and Delete (DML) : Supports merge, update, and delete commands, enabling efficient row‑level modifications without rewriting entire partitions.

Data Anomaly Handling : Provides APIs to define anomaly detection rules and alert thresholds during writes.

Spark API Compatibility : Works with existing Spark pipelines with minimal code changes.

Basic Usage

Create a Table

val data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")

// Partitioned table
df.write.format("delta").partitionBy("date").save("/delta/events")

Read a Table

val df = spark.read.format("delta").load("/tmp/delta-table")
df.show()

Update a Table

// Overwrite
val data = spark.range(5, 10)
data.write.format("delta").mode("overwrite").save("/tmp/delta-table")

// Update rows
import io.delta.tables._
import org.apache.spark.sql.functions._

val deltaTable = DeltaTable.forPath("/tmp/delta-table")

deltaTable.update(
  condition = expr("id % 2 == 0"),
  set = Map("id" -> expr("id + 100"))
)

deltaTable.delete(condition = expr("id % 2 == 0"))

// Upsert (merge) new data
val newData = spark.range(0, 20).as("newData").toDF

deltaTable.as("oldData")
  .merge(newData, "oldData.id = newData.id")
  .whenMatched.update(Map("id" -> col("newData.id")))
  .whenNotMatched.insert(Map("id" -> col("newData.id")))
  .execute()

deltaTable.toDF.show()

Delete from a Table

import io.delta.tables._
val deltaTable = DeltaTable.forPath(spark, pathToEventsTable)

deltaTable.delete("date < '2017-01-01'")

import org.apache.spark.sql.functions._
import spark.implicits._

deltaTable.delete($"date" < "2017-01-01")

Streaming Support

Query Historical Snapshots

Delta Lake time‑travel lets you read previous snapshots using either a timestamp or a version number:

val df1 = spark.read.format("delta").option("timestampAsOf", "2022-01-01").load("/delta/events")
val df2 = spark.read.format("delta").option("versionAsOf", 5).load("/delta/events")

Schema Evolution

New columns can be added automatically when writes use .option("mergeSchema", "true"), and NullType columns are dropped because Parquet does not support them.

Transactional Metadata

Delta Lake maintains an ordered, atomic transaction log on top of files; each commit creates a new JSON log file, enabling fast metadata queries and reliable concurrent modifications.

Terminology

ACID : Atomicity, Consistency, Isolation, Durability guarantees for transactions.

Snapshot : A point‑in‑time view of the data, including file list, metadata, and timestamp.

MetaData : Table‑level information such as id, name, format, schema, and creation time.

Transaction Log : The JSON‑based log (DeltaLog) that records all table operations.

CheckSum : JSON file summarizing a snapshot’s size, file count, metadata, and transaction count for integrity verification.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Lake Apache Spark Scala Delta Lake ACID Transactions

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.