How Delta Lake Transformed Our Offline Data Warehouse Performance
This article details how ZuoYeBang's engineering team migrated their Hive‑based offline data warehouse to Delta Lake, tackling latency, scalability, and query‑performance challenges through stream‑to‑batch processing, data‑lake architecture, and optimizations like DPP and Z‑ordering.
Business Background
ZuoYeBang leverages technology to continuously upgrade its online education platforms, operating multiple middle‑platform systems such as teaching, tutoring, and a big‑data middle platform that builds a company‑wide data warehouse to provide metrics like retention, attendance, and active users for better operational decisions.
Problem & Pain Points
The offline data warehouse built on Hive suffers from three main issues as data volume grows:
ADS table generation delay increases, causing T+1 tables to be visible only at T+2.
Hourly tables cannot meet timeliness due to resource constraints, sometimes requiring 24× more compute.
Data exploration is slow; Hive queries take minutes to hours, and Presto cannot handle large or complex queries without OOM.
Solution & Analysis
The root cause is insufficient Hive compute performance. Two key reasons are:
Hive lacks incremental updates, forcing full recomputation and causing duplicate work.
Hive has no indexes, so even simple queries translate to MR jobs.
Solution Research
Three data‑lake options were evaluated: Delta Lake, Iceberg, and Hudi. Considering ease of use, feature set (Delta Lake supports Z‑ordering and data‑skipping), and Alibaba Cloud support, Delta Lake was selected.
Delta Lake‑Based Offline Warehouse
Data flow: Binlog → Canal → custom data‑distribution service → Kafka (preserving table‑level order) → Spark writes to Delta Lake → Spark SQL serves queries.
Key Technical Challenges
1. Stream‑to‑Batch Conversion
ETL jobs trigger when a day’s partition is ready. Since Delta Lake receives streaming writes, a batch snapshot is created only after the day’s data is fully ingested.
2. Defining Data Readiness
Ensuring 100% ordered data requires handling out‑of‑order events and using heart‑beat tables to detect missing binlog streams, then creating Delta Lake savepoints to guarantee ODS readiness.
3. Maintaining Order Across Sharded Tables
Logical tables are written to dedicated Kafka topics with partitioning based on physical table names, guaranteeing order within each topic.
Performance Optimizations
1) DPP for Write Performance
Using Spark Structured Streaming with
merge intoand Data‑Pruning‑Pushdown (DPP) reduces the files scanned during merges, cutting batch processing latency from >20 minutes to ~3 minutes.
2) Z‑Ordering for Read Performance
Applying Z‑order on frequently filtered columns reduces query latency from tens of minutes to ~24 seconds, a 25× improvement.
3) Z‑Order Build Time Optimization
Single‑pass Z‑order construction lowers build time from ~30 minutes to ~20 minutes, and hotspot‑splitting reduces skewed‑column build time from ~90 minutes to ~30 minutes.
Overall Impact
After six months of development:
ODS readiness advanced from 02:00‑03:00 AM to around 00:10 AM, saving over two hours.
Hourly full‑load tables now finish in ~10 minutes instead of ~40 minutes.
Analyst ad‑hoc query latency dropped from tens of minutes to ~3 minutes.
Future Plans
Remaining challenges include improving data‑correction efficiency, supporting Hive engine compatibility, enabling Flink integration, and reducing the operational cost of maintaining both Spark and Flink ecosystems.
Acknowledgements
Thanks to the Alibaba Cloud EMR Data Lake team for their expertise and support during the migration.
Zuoyebang Tech Team
Sharing technical practices from Zuoyebang
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.