Big Data 12 min read

Exploring JD's Real‑Time Data Lake with Delta Lake: Architecture, Challenges, and Practical Insights

This article introduces JD's real‑time data warehouse evolution, outlines the limitations of traditional Lambda‑based warehouses, compares open‑source lake formats (Delta, Hudi, Iceberg), explains Delta Lake's transaction‑log architecture and read flow, and demonstrates how a unified batch‑stream development model simplifies data processing and improves reliability.

DataFunTalk
DataFunTalk
DataFunTalk
Exploring JD's Real‑Time Data Lake with Delta Lake: Architecture, Challenges, and Practical Insights

Guest: Wang Riyu, JD Big Data Architect Editor: Liu Ming Platform: DataFunTalk

Introduction

The talk reviews JD's real‑time data warehouse, its past and future, how Delta Lake enables incremental offline updates, simplifies the traditional warehouse architecture, and shares business scenarios, implementation experiences, and technical challenges on the data lake.

01 Traditional Data Warehouse Challenges

Traditional warehouses rely on a layered Lambda architecture with separate offline and real‑time pipelines. While this design supports batch and streaming workloads, increasing real‑time demands expose several drawbacks:

Inability to guarantee ACID semantics, leading to read‑write conflicts.

Potential unreliability of offline ingestion (e.g., missing data from distributed MySQL sources).

Lack of fine‑grained update/delete operations in Hive, requiring full table or partition rewrites.

Complex data flow paths, causing duplicated logic and inconsistencies between batch and streaming results.

These four challenges motivate a shift toward a data‑lake‑based solution.

02 Exploring Real‑Time Data Lake

Open‑source lake formats that have gained popularity since 2019 include Delta, Hudi, and Iceberg. Each offers distinct strengths; a brief comparison is provided.

Delta Lake was chosen because it meets functional requirements (ACID, versioning, multi‑version concurrency), aligns with JD's Spark‑centric development, and allows borrowing useful features from Hudi and Iceberg.

03 Delta Lake Core Principles

1. Delta Lake Overview Delta Lake provides an open‑source storage layer with ACID semantics, composed of data files (Parquet) and a transaction log (_delta_log).

2. Transaction Log Details The log records when, who, and how a commit occurs, including file paths, sizes, timestamps, and table metadata (schema, format, properties).

3. Reading a Delta Table To read a table, Spark first locates the latest checkpoint (_last_checkpoint), then reads subsequent JSON log files, merges them with the checkpoint, and determines the exact set of data files for the target version.

Checkpoints are compacted Parquet files that aggregate earlier JSON logs, reducing read latency.

04 Delta Lake Features

Unified batch‑stream read/write

Full ACID guarantees

Support for update/delete operations

Historical versioning and audit

Abstracted storage interface

Improved query performance

05 Integrated Batch‑Stream Development Process

After adopting Delta Lake, JD simplifies its pipeline to a single data flow: business DB binlog → Kafka → Spark Streaming → Delta Lake. Both real‑time and batch jobs can read from the same Delta tables, reducing development and storage costs while easing debugging and rollback.

Summary

Delta Lake offers powerful features such as SQL‑based time travel, dynamic file pruning, Z‑ordering, but still faces challenges like small‑file proliferation, Hive connector compatibility, and multi‑engine support.

Effective management of small and historical files and custom Hive connectors are required for production stability.

Thank you for listening.

big dataACIDreal-time data warehousedata lakeDelta Lake
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.