Big Data 17 min read

Boosting Offline Data Warehouse Performance with DeltaLake: Key Strategies

This article details how Zuoyebang migrated its Hive‑based offline data warehouse to DeltaLake, addressing latency, incremental updates, and query performance through stream‑to‑batch processing, dynamic partition pruning, and Z‑order optimization, resulting in faster data readiness and analyst queries.

Alibaba Cloud Big Data AI Platform

Jul 21, 2022

Boosting Offline Data Warehouse Performance with DeltaLake: Key Strategies

1. Business Background

Zuoyebang is an online education company offering tools such as Zuoyebang, Zuoyebang Oral Calculation, live K12 courses, and quality‑education products like Little Deer Programming, Little Deer Writing, and Little Deer Art, as well as smart learning hardware. Its big‑data middle platform builds a company‑wide data warehouse that provides business‑oriented metrics (retention, attendance, active users) to improve operational decision efficiency.

The data middle platform overview is divided into three layers:

Data Product & Empowerment Layer : data tools and products built on domain data, supporting BI, trend analysis, etc.

Global Data Layer : standardized modeling via OneModel, creating subject‑oriented data for different latency scenarios, improving upstream product efficiency.

Data Development Layer : systems and platforms for data integration, task development, quality, services, and governance, supporting all internal data development.

This sharing focuses on performance issues of the offline warehouse (day‑level, hour‑level) during production and usage.

2. Problems & Pain Points

The offline warehouse built on Hive provides data from ODS to ADS layers, serving BI and analyst queries.

As business and data volume grew, major issues emerged:

ADS table production delay is increasing : full‑chain build time from ODS to ADS grows; short‑term resource skewing is not scalable, causing T+2 visibility for T+1 tables.

Hour‑level tables are hard to meet : high data volume and resource contention require up to 24× resources for hour‑level calculations on day‑level data.

Data exploration is slow, query stability poor : Hive queries take minutes to hours; Presto acceleration is limited by memory OOM and lack of Hive UDF/VIEW support.

3. Solution

3.1 Problem Analysis

Both slow full‑chain production and slow query performance stem from insufficient Hive compute capability.

Root cause of slow chain : Hive lacks incremental updates; binlog data must be de‑duplicated with full data, causing massive recomputation and latency.

Root cause of slow queries : Hive has no indexes, so even simple queries translate to MR jobs, slowing exploration.

3.2 Solution Research

Addressing incremental updates and indexing can improve performance.

HBase+ORC solution : Real‑time writes via HBase, periodic export to ORC on HDFS. However, non‑key column queries are poor, export timing is complex, and operational cost rises.

Data Lake : Formats like DeltaLake, Iceberg, Hudi provide upserts, versioning, and can be read by engines like Flink/Spark without extra services.

Considering ease of use and features (DeltaLake supports Z‑order/Data Skipping), we selected DeltaLake.

4. Offline Warehouse Based on DeltaLake

Architecture after DeltaLake adoption:

Binlog is captured by Canal, written to Kafka via a custom distribution system that preserves per‑table order, then Spark writes batches to DeltaLake; Spark SQL reads data for analytics.

Key technical challenges:

4.1 Stream‑to‑Batch

ETL jobs trigger on partition readiness. Hive supports date partitioning; DeltaLake requires converting streaming writes to batch partitions.

1. Defining data readiness: ensure ordered streams so that when day‑D data appears, day‑(D‑1) is complete.

Solution splits into:

Defining batch boundaries after ordered streams.

Ensuring stream ordering.

Approach: set logical partition field dt, generate dt from event time, create snapshots for each version, and read corresponding snapshots.

To guarantee ordering across sharded MySQL tables, we route each logical table to its own Kafka topic and hash by physical table, ensuring per‑topic order.

Result: binlog streaming into DeltaLake with partition readiness latency <10 minutes.

4.2 Read/Write Performance Optimization

1. Using DPP to improve write performance.

DeltaLake writes via SparkStreamingSQL using MERGE INTO. Default merge reads all files, causing high latency.

We upgraded DeltaLake with Dynamic Partition Pruning (DPP) to prune files before merge.

Analyze merge condition to locate partition fields.

Enumerate partition values.

Build filter to prune file list.

Read only necessary files for merge.

After DPP, batch processing latency reduced from up to 20 minutes to ~3 minutes.

2. Using Z‑order to improve read performance.

Without Z‑order, queries on 200 billion rows took ~11 minutes; with Z‑order latency dropped to ~24 seconds, a ~25× speedup.

Z‑order benefits:

Data Skipping : file‑level min/max statistics enable filter push‑down.

Z‑order layout : reorders data to keep Z‑order columns locally clustered.

Z‑order build time optimization:

Multi‑column Z‑order built in a single pass, reducing build time from ~30 min to ~20 min.

For skewed data, hotspot dispersion reduces build time from ~90 min to ~30 min.

4.3 Overall Effect

Readiness time improved: ODS to DeltaLake shifted from 02:00‑03:00 to around 00:10, over 2 hours earlier.

Hourly full‑load capability added: latency reduced from ~40 min to ~10 min.

Query speed for analysts boosted from tens of minutes to ~3 minutes.

5. Future Plans

Improve data repair efficiency (DeltaLake rollback currently requires reverting all versions).

Full Hive engine compatibility (features like GSCD, Data Skipping not yet in Hive).

Support Flink integration to reduce ecosystem maintenance.

6. Acknowledgments

Thanks to Alibaba Cloud EMR Data Lake team for their expertise and support during the migration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive presto Spark DeltaLake

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.