Big Data 18 min read

Optimizing Real-Time Data Lake Queries on Huawei Cloud with Apache Hudi: Architecture, Indexing, and Performance Enhancements

This article introduces Huawei Cloud's real-time data lake query optimizations using Apache Hudi, covering Hudi's query capabilities, clustering and MDT optimizations, various index types (Min‑max, Lucene, bitmap), caching strategies, and future plans for performance improvements.

DataFunSummit

Jun 6, 2023

Optimizing Real-Time Data Lake Queries on Huawei Cloud with Apache Hudi: Architecture, Indexing, and Performance Enhancements

Huawei Cloud Data Lake continuously evolves, built on HDFS and OBS, with batch data ingested via CDM into Hudi tables and incremental data captured through CDL binlog streams. Query interfaces are exposed via HetuEngine, an enhanced PrestoDB, enabling interactive analytics.

Apache Hudi provides rich query capabilities, including ACID transactions, incremental queries, and advanced features such as clustering, MDT (metadata table), and support for Flink/Spark write paths. Hudi's evolution from a simple file organization layer to a full streaming‑batch server is highlighted.

Performance optimizations focus on two main techniques: clustering and MDT. Clustering reorganizes data layout (order, Z‑order, Hilbert) to enable file‑level pruning, dramatically reducing scan volume. MDT stores file metadata, column statistics, and Bloom filters, allowing query engines to skip irrelevant files, especially beneficial for object storage where list operations are costly.

Various index types are examined:

Min‑max index (stored in MDT) enables efficient range pruning when data is pre‑sorted.

Lucene secondary index offers powerful inverted‑index search but incurs storage overhead and requires file‑level construction.

Bitmap index provides fast equality filtering but can be large and less suitable for range queries.

Index construction follows a lazy, asynchronous workflow: an index request is generated, scheduled, and executed without blocking data ingestion. Indexes are built per file to avoid global row‑id instability.

Caching strategies address bottlenecks such as MDT cold starts, large index loading, and Parquet metadata reads. By caching MDT, statistics, index files, and Parquet metadata on executors, query latency for multi‑TB tables can be reduced to 1‑2 seconds for the majority of queries.

Future work includes improving hotspot data caching, building real‑time materialized views, and enhancing MOR table read performance using techniques like delete vectors, as well as further optimizing index and statistics handling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Query Optimization Data Lake Huawei Cloud Apache Hudi

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.