Unlocking Delta Lake: Key Features, Architecture, and EMR Integration
This article introduces Delta Lake as an open‑source lakehouse storage framework, explains its core features, file and metadata structures, details Alibaba Cloud EMR's enhancements and deep integration with DLF, and presents G‑SCD and CDC solutions for real‑time incremental data warehousing.
Delta Lake Overview
Delta Lake is an open‑source storage framework from Databricks for building lakehouse architectures. It works with Spark, Flink, Hive, PrestoDB, Trino and provides ACID transactions, data versioning, parquet‑based storage, batch‑and‑stream unified access, schema evolution, and rich DML (upsert, delete, merge).
File Structure
Delta tables store metadata in a self‑managed _delta_log directory and data files in the surrounding directory. Each commit creates a JSON log file; every ten commits a parquet checkpoint file is generated to accelerate metadata parsing and enable periodic cleanup. Data files follow Hive‑style partitioning, but only files referenced by the latest snapshot are valid.
Metadata Structure
Each snapshot contains three components: the protocol version, table metadata (schema and configuration), and the list of active data files (AddFile/RemoveFile actions). Loading a snapshot first reads the nearest checkpoint and then applies subsequent log files.
EMR DeltaLake
Alibaba Cloud EMR has integrated Delta Lake since 2019, adding feature iterations, performance optimizations, ecosystem integration, and usability improvements while keeping compatibility with Spark 2 (Delta 0.6) and Spark 3 (Delta 1.x).
Key enhancements over open‑source DeltaLake 1.1.0
Feature iteration
DML syntax enhancements: TIME‑TRAVEL (VERSION/TIMESTAMP AS OF), SHOW/DROP PARTITION, dynamic partition overwrite.
Metadata synchronization with Hive metastore.
Automated lake‑table management: auto‑optimize small files, auto‑vacuum, savepoints, rollback, adaptive file sizing.
Performance optimization
Min‑max statistics and data skipping.
Dynamic partition pruning (DPP).
Runtime file pruning.
Custom manifest support for Hive/Presto/Trino/Impala.
Ecosystem integration
Support for OSS and JindoData.
Deep integration with Alibaba Cloud DLF.
Scenario landing
SCD Type 2 incremental lakehouse solution.
CDC solution built on DeltaLake.
Deep Integration with DLF
Data Lake Formation (DLF) provides managed metadata, security, and data ingestion. EMR DeltaLake automatically syncs table metadata to DLF’s metastore, enabling direct queries from Hive, Presto, Impala, MaxCompute, and Hologres without extra steps.
G‑SCD Solution
G‑SCD (Based‑Granularity Slowly Changing Dimension) stores only the latest version per granularity (day/hour) using DeltaLake’s versioning, avoiding full‑snapshot storage while allowing time‑travel queries to retrieve historical snapshots.
CDC Solution
EMR DeltaLake can act as a streaming source, emitting ChangeData for every write operation. Upstream Kafka data is ingested, partitioned by business snapshot, committed, and saved as snapshots. Downstream queries use the snapshot associated with a given business time to perform time‑travel reads.
Future Plans
DeltaLake will continue to receive investment, deeper DLF integration, richer lake‑table management, lower onboarding cost, performance tuning, and broader ecosystem support within Alibaba Cloud’s big‑data stack.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
