Big Data 11 min read

Unlocking Delta Lake: Key Features, Architecture, and EMR Integration

This article introduces Delta Lake as an open‑source lakehouse storage framework, explains its core features, file and metadata structures, details Alibaba Cloud EMR's enhancements and deep integration with DLF, and presents G‑SCD and CDC solutions for real‑time incremental data warehousing.

Alibaba Cloud Developer

Jun 17, 2022

Unlocking Delta Lake: Key Features, Architecture, and EMR Integration

Delta Lake Overview

Delta Lake is an open‑source storage framework from Databricks for building lakehouse architectures. It works with Spark, Flink, Hive, PrestoDB, Trino and provides ACID transactions, data versioning, parquet‑based storage, batch‑and‑stream unified access, schema evolution, and rich DML (upsert, delete, merge).

File Structure

Delta tables store metadata in a self‑managed _delta_log directory and data files in the surrounding directory. Each commit creates a JSON log file; every ten commits a parquet checkpoint file is generated to accelerate metadata parsing and enable periodic cleanup. Data files follow Hive‑style partitioning, but only files referenced by the latest snapshot are valid.

Metadata Structure

Each snapshot contains three components: the protocol version, table metadata (schema and configuration), and the list of active data files (AddFile/RemoveFile actions). Loading a snapshot first reads the nearest checkpoint and then applies subsequent log files.

EMR DeltaLake

Alibaba Cloud EMR has integrated Delta Lake since 2019, adding feature iterations, performance optimizations, ecosystem integration, and usability improvements while keeping compatibility with Spark 2 (Delta 0.6) and Spark 3 (Delta 1.x).

Key enhancements over open‑source DeltaLake 1.1.0

Feature iteration

DML syntax enhancements: TIME‑TRAVEL (VERSION/TIMESTAMP AS OF), SHOW/DROP PARTITION, dynamic partition overwrite.

Metadata synchronization with Hive metastore.

Automated lake‑table management: auto‑optimize small files, auto‑vacuum, savepoints, rollback, adaptive file sizing.

Performance optimization

Min‑max statistics and data skipping.

Dynamic partition pruning (DPP).

Runtime file pruning.

Custom manifest support for Hive/Presto/Trino/Impala.

Ecosystem integration

Support for OSS and JindoData.

Deep integration with Alibaba Cloud DLF.

Scenario landing

SCD Type 2 incremental lakehouse solution.

CDC solution built on DeltaLake.

Deep Integration with DLF

Data Lake Formation (DLF) provides managed metadata, security, and data ingestion. EMR DeltaLake automatically syncs table metadata to DLF’s metastore, enabling direct queries from Hive, Presto, Impala, MaxCompute, and Hologres without extra steps.

G‑SCD Solution

G‑SCD (Based‑Granularity Slowly Changing Dimension) stores only the latest version per granularity (day/hour) using DeltaLake’s versioning, avoiding full‑snapshot storage while allowing time‑travel queries to retrieve historical snapshots.

CDC Solution

EMR DeltaLake can act as a streaming source, emitting ChangeData for every write operation. Upstream Kafka data is ingested, partitioned by business snapshot, committed, and saved as snapshots. Downstream queries use the snapshot associated with a given business time to perform time‑travel reads.

Future Plans

DeltaLake will continue to receive investment, deeper DLF integration, richer lake‑table management, lower onboarding cost, performance tuning, and broader ecosystem support within Alibaba Cloud’s big‑data stack.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Lakehouse CDC EMR Delta Lake DLF SCD

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.