Big Data 11 min read

Unlocking Delta Lake: Key Features, Architecture, and EMR Integration

Delta Lake, an open‑source storage layer from Databricks, provides ACID transactions, data versioning, schema evolution, and unified batch‑stream processing, with a detailed file structure and metadata mechanism, while Alibaba Cloud EMR enhances it with advanced DML, performance optimizations, deep DLF integration, and solutions for G‑SCD and CDC.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Unlocking Delta Lake: Key Features, Architecture, and EMR Integration

Delta Lake Overview

Delta Lake is an open‑source storage framework from Databricks for building lake‑house architectures. It supports Spark, Flink, Hive, PrestoDB, Trino and other query/compute engines, offering batch‑stream unification with reliable, secure, high‑performance guarantees.

Key Features

ACID transactions with multiple isolation levels for concurrent read/write pipelines.

Data version management via snapshots, enabling time‑travel queries and audit of data and metadata.

Open file format based on Parquet, providing high‑performance compression.

Unified batch and streaming reads/writes.

Schema evolution allowing merges or rewrites to adapt to changing data structures.

Rich DML support (Upsert, Delete, Merge) for scenarios such as CDC.

File Structure

Delta tables consist of two main parts:

_delta_log directory : stores all metadata information. Each commit (data operation or metadata change) creates a new JSON log file describing actions such as added or removed files. Every ten commits are compacted into a Parquet checkpoint file to accelerate metadata parsing and enable periodic cleanup.

Data directory/files : contains the actual table data. Partitioning follows the Hive style, and only files referenced by the latest snapshot in _delta_log are considered valid.

Delta Lake file structure
Delta Lake file structure

Metadata Mechanism

Delta Lake manages table versions through snapshots. Loading a specific snapshot involves locating the nearest checkpoint file and applying subsequent log files to reconstruct metadata, which includes the protocol version, table schema and configuration, and the list of active data files (AddFile/RemoveFile).

EMR DeltaLake Enhancements

Alibaba Cloud EMR has integrated Delta Lake since 2019, adding numerous features:

Feature Iteration

Enhanced DML syntax: time‑travel SQL (VERSION/TIMESTAMP AS OF), partition management, dynamic partition overwrite.

Metadata Synchronization

Metastore integration with DLF/Hive.

Automated Table Management

Auto‑optimize small files.

Auto‑vacuum for expired files.

Savepoint and rollback support.

Automatic file size adjustment.

Performance Optimization

Min‑max statistics, data skipping, dynamic partition pruning, runtime filter, custom manifest for faster Hive/Presto/Trino/Impala queries.

Ecosystem Integration

Support for Presto, Trino, Impala, MaxCompute, Hologres, OSS, JindoData, and deep integration with DLF.

Use Cases

Slowly changing dimension (SCD) Type 2 solutions and CDC pipelines.

Deep Integration with DLF

DLF (Data Lake Formation) provides managed metadata, security, and data ingestion capabilities. EMR DeltaLake automatically syncs table metadata to DLF’s metastore, allowing direct queries via Hive, Presto, Impala, MaxCompute, and Hologres without extra configuration. DLF also supports ingesting MySQL, RDS, and Kafka data directly into Delta tables.

G‑SCD Solution

Based‑Granularity Slowly Changing Dimension (G‑SCD) manages dimension changes at a fixed granularity (e.g., daily or hourly) using Delta Lake’s versioning. This avoids storing full historical snapshots, reduces storage, and leverages Delta’s checkpointing, Z‑ordering, and data‑skipping for efficient queries.

G‑SCD architecture
G‑SCD architecture

CDC Solution

EMR DeltaLake can act as a streaming source, generating ChangeData alongside regular DML operations. Upstream Kafka data is batched, committed with snapshot identifiers, and retained as savepoints. Downstream layers (e.g., DWS) read ChangeData to update aggregates, enabling real‑time incremental data warehouses.

CDC pipeline
CDC pipeline

Future Plans

Delta Lake remains a core lake format in EMR, with ongoing investments to deepen DLF integration, enhance table management, lower lake‑onboarding costs, optimize read/write performance, and expand the Alibaba Cloud big‑data ecosystem, supporting customers in building unified lake‑house architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CDCEMRData LakehouseDelta LakeDLFSCD
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.