Optimizing Real-Time Data Lake Queries on Huawei Cloud with Apache Hudi: Architecture, Indexing, and Performance Enhancements
This article introduces Huawei Cloud's real-time data lake query optimizations using Apache Hudi, covering Hudi's query capabilities, clustering and MDT optimizations, various index types (Min‑max, Lucene, bitmap), caching strategies, and future plans for performance improvements.
Huawei Cloud Data Lake continuously evolves, built on HDFS and OBS, with batch data ingested via CDM into Hudi tables and incremental data captured through CDL binlog streams. Query interfaces are exposed via HetuEngine, an enhanced PrestoDB, enabling interactive analytics.
Apache Hudi provides rich query capabilities, including ACID transactions, incremental queries, and advanced features such as clustering, MDT (metadata table), and support for Flink/Spark write paths. Hudi's evolution from a simple file organization layer to a full streaming‑batch server is highlighted.
Performance optimizations focus on two main techniques: clustering and MDT. Clustering reorganizes data layout (order, Z‑order, Hilbert) to enable file‑level pruning, dramatically reducing scan volume. MDT stores file metadata, column statistics, and Bloom filters, allowing query engines to skip irrelevant files, especially beneficial for object storage where list operations are costly.
Various index types are examined:
Min‑max index (stored in MDT) enables efficient range pruning when data is pre‑sorted.
Lucene secondary index offers powerful inverted‑index search but incurs storage overhead and requires file‑level construction.
Bitmap index provides fast equality filtering but can be large and less suitable for range queries.
Index construction follows a lazy, asynchronous workflow: an index request is generated, scheduled, and executed without blocking data ingestion. Indexes are built per file to avoid global row‑id instability.
Caching strategies address bottlenecks such as MDT cold starts, large index loading, and Parquet metadata reads. By caching MDT, statistics, index files, and Parquet metadata on executors, query latency for multi‑TB tables can be reduced to 1‑2 seconds for the majority of queries.
Future work includes improving hotspot data caching, building real‑time materialized views, and enhancing MOR table read performance using techniques like delete vectors, as well as further optimizing index and statistics handling.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.