Big Data 15 min read

How Delta Lake Transformed Our Offline Data Warehouse Performance

This article details how ZuoYeBang's engineering team migrated their Hive‑based offline data warehouse to Delta Lake, tackling latency, scalability, and query‑performance challenges through stream‑to‑batch processing, data‑lake architecture, and optimizations like DPP and Z‑ordering.

Zuoyebang Tech Team

Apr 13, 2022

How Delta Lake Transformed Our Offline Data Warehouse Performance

Business Background

ZuoYeBang leverages technology to continuously upgrade its online education platforms, operating multiple middle‑platform systems such as teaching, tutoring, and a big‑data middle platform that builds a company‑wide data warehouse to provide metrics like retention, attendance, and active users for better operational decisions.

Problem & Pain Points

The offline data warehouse built on Hive suffers from three main issues as data volume grows:

ADS table generation delay increases, causing T+1 tables to be visible only at T+2.

Hourly tables cannot meet timeliness due to resource constraints, sometimes requiring 24× more compute.

Data exploration is slow; Hive queries take minutes to hours, and Presto cannot handle large or complex queries without OOM.

Solution & Analysis

The root cause is insufficient Hive compute performance. Two key reasons are:

Hive lacks incremental updates, forcing full recomputation and causing duplicate work.

Hive has no indexes, so even simple queries translate to MR jobs.

Solution Research

Three data‑lake options were evaluated: Delta Lake, Iceberg, and Hudi. Considering ease of use, feature set (Delta Lake supports Z‑ordering and data‑skipping), and Alibaba Cloud support, Delta Lake was selected.

Delta Lake‑Based Offline Warehouse

Data flow: Binlog → Canal → custom data‑distribution service → Kafka (preserving table‑level order) → Spark writes to Delta Lake → Spark SQL serves queries.

Key Technical Challenges

1. Stream‑to‑Batch Conversion

ETL jobs trigger when a day’s partition is ready. Since Delta Lake receives streaming writes, a batch snapshot is created only after the day’s data is fully ingested.

2. Defining Data Readiness

Ensuring 100% ordered data requires handling out‑of‑order events and using heart‑beat tables to detect missing binlog streams, then creating Delta Lake savepoints to guarantee ODS readiness.

3. Maintaining Order Across Sharded Tables

Logical tables are written to dedicated Kafka topics with partitioning based on physical table names, guaranteeing order within each topic.

Performance Optimizations

1) DPP for Write Performance

Using Spark Structured Streaming with merge into and Data‑Pruning‑Pushdown (DPP) reduces the files scanned during merges, cutting batch processing latency from >20 minutes to ~3 minutes.

2) Z‑Ordering for Read Performance

Applying Z‑order on frequently filtered columns reduces query latency from tens of minutes to ~24 seconds, a 25× improvement.

3) Z‑Order Build Time Optimization

Single‑pass Z‑order construction lowers build time from ~30 minutes to ~20 minutes, and hotspot‑splitting reduces skewed‑column build time from ~90 minutes to ~30 minutes.

Overall Impact

After six months of development:

ODS readiness advanced from 02:00‑03:00 AM to around 00:10 AM, saving over two hours.

Hourly full‑load tables now finish in ~10 minutes instead of ~40 minutes.

Analyst ad‑hoc query latency dropped from tens of minutes to ~3 minutes.

Future Plans

Remaining challenges include improving data‑correction efficiency, supporting Hive engine compatibility, enabling Flink integration, and reducing the operational cost of maintaining both Spark and Flink ecosystems.

Acknowledgements

Thanks to the Alibaba Cloud EMR Data Lake team for their expertise and support during the migration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Hive presto Spark Delta Lake offline data warehouse

Written by

Zuoyebang Tech Team

Sharing technical practices from Zuoyebang

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.