How JD Transformed Its Data Warehouse with Delta Lake for Real‑Time Analytics
This article examines JD's shift from a traditional Lambda‑based data warehouse to a Delta Lake‑powered real‑time data lake, detailing the challenges of legacy architectures, the evaluation of open‑source table formats, Delta Lake's core mechanisms, and the resulting simplified batch‑stream development workflow.
1. Challenges of Traditional Data Warehouse
Traditional offline data warehouses in JD rely on a Lambda architecture that separates offline (day‑level) and real‑time (second‑level) processing. While this design supports large‑scale analytics, it increasingly shows four major drawbacks:
ACID semantics cannot be guaranteed : writes and reads cannot occur simultaneously, leading to consistency issues that are mitigated only by scheduling mismatches.
Potential unreliability of offline ingestion : T+1 batch jobs may miss data if any of the thousands of source MySQL instances fail, causing gaps in downstream analysis.
Lack of fine‑grained update capability : Hive tables require full rewrites of partitions (typically one day) for any row‑level change, incurring high I/O and latency.
Complex data‑flow paths : Maintaining parallel batch and real‑time pipelines duplicates logic and can produce inconsistent results when business logic changes.
2. Exploration and Experience of a Real‑Time Data Lake
2.1 Open‑source data‑lake candidates
Since 2019 the community has converged on three major table‑format projects: Delta Lake, Apache Hudi, and Apache Iceberg. Their strengths differ in ACID support, update mechanisms, and integration with processing engines.
2.2 Why Delta Lake was chosen
Delta Lake offered the most complete feature set for JD’s requirements—full ACID guarantees, versioned history, and strong Spark integration. The team, already responsible for Spark SQL and shuffle optimizations, leveraged Delta while borrowing useful ideas from Hudi and Iceberg.
3. Core Principles of Delta Lake
3.1 Delta Lake overview
Delta Lake adds an ACID‑compliant storage layer on top of parquet files. Each Delta table consists of data files and a _delta_log directory that stores transaction logs in JSON format.
3.2 Transaction log details
The log records three dimensions of each commit: when (timestamp), who (user or process), and how (files added, deleted, and table metadata). Example entries show timestamps, file paths, sizes, and schema information.
3.3 Reading a Delta table
Reading follows these steps:
Locate the latest _last_checkpoint file to find the most recent checkpoint parquet.
Read JSON log files with version numbers greater than the checkpoint (e.g., versions 11 and 12).
Merge checkpoint data with subsequent logs to reconstruct the table’s current state.
Checkpoints aggregate earlier logs, eliminate redundancy, and are stored as parquet to improve Spark read performance.
4. Batch‑Stream Integrated Development Process
After adopting Delta Lake, JD reduced the architecture to a single data‑flow pipeline: binlog events are streamed to Kafka, consumed by Spark Streaming, parsed, and written directly into the Delta lake. The same Delta tables can then be queried by both real‑time and batch jobs, lowering development and storage costs and simplifying rollback and debugging of dirty data.
5. Summary and Open Issues
Delta Lake brings ACID semantics, fine‑grained updates, versioned history, abstract storage interfaces, and query‑performance gains to JD’s data platform. Remaining challenges include:
Proliferation of small files that stress HDFS NameNode.
Limited Hive connector support, requiring custom adaptations for production‑grade versions.
Complexities when integrating Delta tables with external engines such as Presto, which need separate external table definitions.
Overall, the transition demonstrates how a modern data‑lake architecture can simplify batch‑stream development while addressing the shortcomings of legacy Lambda‑based warehouses.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
