Delta Lake: ACID Transactions, Schema Management, and Unified Batch‑Streaming for Data Lakes
Delta Lake adds ACID transaction support, schema enforcement, data versioning, and unified batch‑and‑stream processing to Apache Spark‑based data lakes, addressing reliability, quality, performance, and update challenges of traditional data lake architectures.
Delta Lake is a storage layer that brings ACID transaction capabilities to Apache Spark and big‑data workloads by using optimistic concurrency control and snapshot isolation, providing consistent reads during writes and built‑in data versioning for reliable data lakes on HDFS or cloud storage.
Why Delta Lake Is Needed
Many organizations rely on data lakes for massive, heterogeneous data storage, but they suffer from unreliable reads/writes, low data quality, poor performance as data grows, and cumbersome update processes. These issues often cause big‑data projects to fail, creating a need for a solution that improves reliability and quality without replacing existing lakes.
Key Features of Delta Lake
ACID Transactions : Guarantees atomic, consistent, isolated, and durable operations across concurrent writes, with a transaction log that records write order and throws concurrency exceptions when conflicts occur.
Schema Management : Validates DataFrame schemas against table schemas, automatically adds new columns, and stores metadata in the transaction log for fast metadata handling.
Data Versioning : Keeps historical snapshots, allowing time‑travel queries by timestamp or version number to reproduce or roll back data.
Unified Batch and Streaming Sink : Works as a high‑performance streaming sink for Structured Streaming while retaining ACID guarantees.
Parquet Storage Format : Stores all data in open‑source Apache Parquet, leveraging its compression and encoding efficiencies.
Update and Delete (DML) : Supports merge, update, and delete commands, enabling efficient row‑level modifications without rewriting entire partitions.
Data Anomaly Handling : Provides APIs to define anomaly detection rules and alert thresholds during writes.
Spark API Compatibility : Works with existing Spark pipelines with minimal code changes.
Basic Usage
Create a Table
val data = spark.range(0, 5)
data.write.format("delta").save("/tmp/delta-table")
// Partitioned table
df.write.format("delta").partitionBy("date").save("/delta/events")Read a Table
val df = spark.read.format("delta").load("/tmp/delta-table")
df.show()Update a Table
// Overwrite
val data = spark.range(5, 10)
data.write.format("delta").mode("overwrite").save("/tmp/delta-table")
// Update rows
import io.delta.tables._
import org.apache.spark.sql.functions._
val deltaTable = DeltaTable.forPath("/tmp/delta-table")
deltaTable.update(
condition = expr("id % 2 == 0"),
set = Map("id" -> expr("id + 100"))
)
deltaTable.delete(condition = expr("id % 2 == 0"))
// Upsert (merge) new data
val newData = spark.range(0, 20).as("newData").toDF
deltaTable.as("oldData")
.merge(newData, "oldData.id = newData.id")
.whenMatched.update(Map("id" -> col("newData.id")))
.whenNotMatched.insert(Map("id" -> col("newData.id")))
.execute()
deltaTable.toDF.show()Delete from a Table
import io.delta.tables._
val deltaTable = DeltaTable.forPath(spark, pathToEventsTable)
deltaTable.delete("date < '2017-01-01'")
import org.apache.spark.sql.functions._
import spark.implicits._
deltaTable.delete($"date" < "2017-01-01")Streaming Support
Query Historical Snapshots
Delta Lake time‑travel lets you read previous snapshots using either a timestamp or a version number:
val df1 = spark.read.format("delta").option("timestampAsOf", "2022-01-01").load("/delta/events")
val df2 = spark.read.format("delta").option("versionAsOf", 5).load("/delta/events")Schema Evolution
New columns can be added automatically when writes use .option("mergeSchema", "true"), and NullType columns are dropped because Parquet does not support them.
Transactional Metadata
Delta Lake maintains an ordered, atomic transaction log on top of files; each commit creates a new JSON log file, enabling fast metadata queries and reliable concurrent modifications.
Terminology
ACID : Atomicity, Consistency, Isolation, Durability guarantees for transactions.
Snapshot : A point‑in‑time view of the data, including file list, metadata, and timestamp.
MetaData : Table‑level information such as id, name, format, schema, and creation time.
Transaction Log : The JSON‑based log (DeltaLog) that records all table operations.
CheckSum : JSON file summarizing a snapshot’s size, file count, metadata, and transaction count for integrity verification.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
